ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

@article{Neumann2019ScispaCyFA,
  title={ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing},
  author={Mark Neumann and Daniel King and Iz Beltagy and Waleed Ammar},
  journal={ArXiv},
  year={2019},
  volume={abs/1902.07669}
}
Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scispaCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail… Expand
SciBERT: A Pretrained Language Model for Scientific Text
TLDR
SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks and demonstrates statistically significant improvements over BERT. Expand
Biomedical and clinical English model packages for the Stanza Python NLP library
TLDR
The study introduces biomedical and clinical NLP packages built for the Stanza library, which offer performance that is similar to the state of the art, and is also optimized for ease of use. Expand
WTMED at MEDIQA 2019: A Hybrid Approach to Biomedical Natural Language Inference
TLDR
This paper proposes a hybrid approach to biomedical NLI where different types of information are exploited for this task, using a base model that includes a pre-trained text encoder as the core component, and a syntax encoder and a feature encoder to capture syntactic and domain-specific information. Expand
Med7: a transferable clinical natural language processing model for electronic health records
TLDR
A named-entity recognition model for clinical natural language processing is introduced and the transferability of the developed model using the data from the Intensive Care Unit in the US to secondary care mental health records (CRIS) in the UK is evaluated. Expand
Clinical Phrase Mining with Language Models
TLDR
Experimental results on the MIMIC-III dataset show that the proposed CliniPhrase method can outperform the current state-of-the-art techniques by up to 18% in terms of F1 measure while being very efficient (up to 48 times faster). Expand
A Natural Language Processing System for Extracting Evidence of Drug Repurposing from Scientific Publications
TLDR
This emerging applications paper introduces a system to automate non-cancer generic drug evidence extraction from PubMed abstracts, comprising the following modules: querying, filtering, cancer type entity extraction, therapeutic association classification, and study type classification. Expand
Investigating the Effect of Lexical Segmentation in Transformer-based Models on Medical Datasets
TLDR
This work investigates the effects of a specialised in- domain vocabulary trained from scratch on a biomedical corpus, and suggests that, although the in-domain vocabulary is useful, it is usually constrained by the corpora size because these models needs to be training from scratch. Expand
Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python
TLDR
MedspaCy, an extensible, open-source cNLP library based on spaCy framework that allows flexible integration of rule-based and machine learning-based algorithms adapted to clinical text, is introduced. Expand
YerevaNN’s Systems for WMT20 Biomedical Translation Task: The Effect of Fixing Misaligned Sentence Pairs
TLDR
YerevaNN’s neural machine translation systems and data processing pipelines developed for WMT20 biomedical translation task are described and most of the improvements are explained by the heavy data preprocessing pipeline which attempts to fix poorly aligned sentences in the parallel data. Expand
Neural Natural Language Processing for Unstructured Data in Electronic Health Records: a Review
TLDR
This survey paper summarizes current neural NLP methods for EHR applications, and focuses on a broad scope of tasks, namely, classification and prediction, word embeddings, extraction, generation, and other topics such as question answering, phenotyping, knowledge graphs, medical dialogue, multilinguality, interpretability, etc. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 46 REFERENCES
GENIA corpus - a semantically annotated corpus for bio-textmining
MOTIVATION Natural language processing (NLP) methods are regarded as being useful to raise the potential of text mining from biological literature. The lack of an extensively annotated corpus of thisExpand
Large-scale automated machine reading discovers new cancer-driving mechanisms
TLDR
Reaching, a system for automated, large-scale machine reading of biomedical papers that can extract mechanistic descriptions of biological processes with relatively high precision at high throughput, demonstrates that combining the extracted pathway fragments with existing biological data analysis algorithms helps identify and explain a large number of previously unidentified mutually exclusive altered signaling pathways in seven different cancer types. Expand
A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text
TLDR
This paper shows that the problem of identifying abbreviations' definitions can be solved with a much simpler algorithm than that proposed by other research efforts, and achieves 96% precision and 82% recall on a standard test collection, which is at least as good as existing approaches. Expand
Adapting a Lexicalized-Grammar Parser to Contrasting Domains
TLDR
It is demonstrated that a CCG parser can be adapted to two new domains, biomedical text and questions for a QA system, by using manually-annotated training data at the pos and lexical category levels only, which achieves parser accuracy comparable to that on newspaper data without the need for annotated parse trees in the new domain. Expand
Effective Use of Bidirectional Language Modeling for Transfer Learning in Biomedical Named Entity Recognition
TLDR
This work trains a bidirectional language model (BiLM) on unlabeled data and transfers its weights to "pretrain" an NER model with the same architecture as the BiLM, which results in a better parameter initialization of the NER models. Expand
NCBI disease corpus: A resource for disease name recognition and concept normalization
TLDR
The results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. Expand
From POS tagging to dependency parsing for biomedical event extraction
TLDR
A detailed empirical study comparing traditional feature-based and neural network-based models for POS tagging and dependency parsing in the biomedical context is presented, and the influence of parser selection for a biomedical event extraction downstream task is investigated. Expand
Automatically Adapting an NLP Core Engine to the Biology Domain
Background: Rather than specifying rules, constraints and lexicons for NLP systems manually, we advocate a procedure for automatically acquiring linguistic knowledge using machine learning (ML)Expand
CHEMDNER: The drugs and chemical names extraction challenge
TLDR
This task allowed a comparative assessment of the performance of various methodologies using a carefully prepared collection of manually labeled text prepared by specially trained chemists as Gold Standard data, and expected that the tools and resources resulting from this effort will have an impact in future developments of chemical text mining applications. Expand
LINNAEUS: A species name identification system for biomedical literature
TLDR
LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can be integrated into a range of bioinformatics and text-mining applications. Expand
...
1
2
3
4
5
...