Generating a Distilled N-Gram Set - Effective Lexical Multiword Building in the SPECIALIST Lexicon

  title={Generating a Distilled N-Gram Set - Effective Lexical Multiword Building in the SPECIALIST Lexicon},
  author={C. J. Lu and Destinee L. Tormey and Lynn McCreedy and A. Browne},
Multiwords are vital to better Natural Language Processing (NLP) systems for more effective and efficient parsers, refining information retrieval searches, enhancing precision and recall in Medical Language Processing (MLP) applications, etc. The Lexical Systems Group has enhanced the coverage of multiwords in the Lexicon to provide a more comprehensive resource for such applications. This paper describes a new systematic approach to lexical multiword acquisition from MEDLINE through filters… Expand
The Unified Medical Language System SPECIALIST Lexicon and Lexical Tools: Development and applications
The objective is to provide generic, broad coverage and a robust lexical system for NLP applications, and a novel multiword approach and other planned developments are proposed. Expand
Enhanced LexSynonym Acquisition for Effective UMLS Concept Mapping
The LSG has developed a new system for element synonym acquisition based on new enhanced requirements and design for better performance and the results show a 36.71 times growth of synonyms in the Lexicon (lexSynonym) in the 2017 release. Expand
Enhanced Features in the SPECIALIST Lexicon - Antonyms
The objective is to develop a systematic approach to generate antonyms in the SPECIALIST Lexicon (thereafter, the Lexicon) and hope to provide generic and comprehensive antonym features needed for the NLP community. Expand
Spell checker for consumer language (CSpell)
CSpell improves over the state of the art and provides near real-time automatic misspelling detection and correction in consumer health questions. Expand
Improving Spelling Correction with Consumer Health Terminology
The project launched the Consumer Health Information and Question Answering (CHIQA) project to help consumers find reliable health information and a systematic approach was developed to retrieve consumer health terminology (CHT) from the UMLS Metathesaurus and MEDLINE. Expand
DNS anti-attack machine learning model for DGA domain name detection
Experimental results show that the approach based on Machine Learning can effectively identify DGA domain names. Expand


Generating the MEDLINE N-Gram Set
This work processed 2.6 billion single words from 22.4 million MEDLINE documents (titles and abstracts) to generate MEDLINE n-grams (n = 1 to 5) with terms appearing at least 30 times and having less than 50 characters for the 2014 release to resolve the Java limitation issue. Expand
Using Element Words to Generate (Multi)words for the SPECIALIST Lexicon
A new systematic approach to identify single words and multiwords from MEDLINE through the use of element words, which shows an accelerated growth of the Lexicon, particularly an increase in multiword records. Expand
Towards Best Practice for Multiword Expressions in Computational Lexicons
The goal is to define a set of minimal lexicon “objects”, which can serve not only as a model for MWEs but also for lexical data in general, and establish uniform standards for describing multi-word lexical entries. Expand
A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora
The availability of multi-word units (MWUs) in NLP lexica has important applications: enhances parsing precision, helps on attachment decision and enables more natural interaction of non-specialistsExpand
Multiword Expressions Acquisition: A Generic and Open Framework
This book is the first book to cover the whole pipeline of multiword expression acquisition in a single volume, and contains solid experimental results and evaluates the mwetoolkit, demonstrating its usefulness for computer-assisted lexicography and machine translation. Expand
Parsing Models for Identifying Multiword Expressions
This work develops two structured prediction models for joint parsing and multiword expression identification that can identify multiword expressions with much higher accuracy than a state-of-the-art system based on word co-occurrence statistics. Expand
Multi-Word Expression Identification Using Sentence Surface Features
It is shown that simple rule-based baselines do not perform identification satisfactorily, and a supervised learning method for identification that uses sentence surface features based on expressions' canonical form is presented. Expand
Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources
This work defines various linguistically motivated classification features and introduces novel ways for computing them, and manually defines interrelationships among the features, and expresses them in a Bayesian network, resulting in a powerful classifier that can identify multiword expressions of various types and multiple syntactic constructions in text corpora. Expand
A Systematic Approach for Automatically Generating Derivational Variants in Lexical Tools Based on the SPECIALIST Lexicon
1. Introduction The demand for natural language processing (NLP) in medicine has grown significantly in recent years. This growth is expected to increase rapidly due to the continuing adoption ofExpand
Multilingual collocation extraction with a syntactic parser
Parsing, even if imperfect, leads to a significant improvement in the quality of results, in terms of collocational precision, MWE precision, and grammatical precision, which bears a high importance in the perspective of the subsequent integration of extraction results in other NLP applications. Expand