ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

  title={ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing},
  author={Mark Neumann and Daniel King and Iz Beltagy and Waleed Ammar},
Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scispaCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail… 

Figures and Tables from this paper

Biomedical and clinical English model packages for the Stanza Python NLP library

The study introduces biomedical and clinical NLP packages built for the Stanza library, which offer performance that is similar to the state of the art, and is also optimized for ease of use.

Fine-Tuning Large Neural Language Models for Biomedical Natural Language Processing

It is shown that finetuning performance may be sensitive to pretraining settings, especially in low-resource domains, and that domainspecific vocabulary and pretraining facilitate more robust models for fine-tuning.

SciBERT: A Pretrained Language Model for Scientific Text

SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks and demonstrates statistically significant improvements over BERT.

English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19

This toolbox is freely available on github and can be used for text analytics in a variety of settings related to the COVID-19 crisis and will be expanded and applied in NLP tasks over the next weeks and the community is invited to contribute.

Publicly Available Clinical BERT Embeddings

This work explores and releases two BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically, and demonstrates that using a domain-specific model yields performance improvements on 3/5 clinical NLP tasks, establishing a new state-of-the-art on the MedNLI dataset.

WTMED at MEDIQA 2019: A Hybrid Approach to Biomedical Natural Language Inference

This paper proposes a hybrid approach to biomedical NLI where different types of information are exploited for this task, using a base model that includes a pre-trained text encoder as the core component, and a syntax encoder and a feature encoder to capture syntactic and domain-specific information.

Clinical Phrase Mining with Language Models

Experimental results on the MIMIC-III dataset show that the proposed CliniPhrase method can outperform the current state-of-the-art techniques by up to 18% in terms of F1 measure while being very efficient (up to 48 times faster).

The Treasury Chest of Text Mining: Piling Available Resources for Powerful Biomedical Text Mining

This review aims to gather the leading tools for biomedical TM, summarily describing and systematizing them and surveyed several resources to compile the most valuable ones for each category.

A Natural Language Processing System for Extracting Evidence of Drug Repurposing from Scientific Publications

This emerging applications paper introduces a system to automate non-cancer generic drug evidence extraction from PubMed abstracts, comprising the following modules: querying, filtering, cancer type entity extraction, therapeutic association classification, and study type classification.



GENIA corpus - a semantically annotated corpus for bio-textmining

MOTIVATION Natural language processing (NLP) methods are regarded as being useful to raise the potential of text mining from biological literature. The lack of an extensively annotated corpus of this

Large-scale automated machine reading discovers new cancer-driving mechanisms

Reaching, a system for automated, large-scale machine reading of biomedical papers that can extract mechanistic descriptions of biological processes with relatively high precision at high throughput, demonstrates that combining the extracted pathway fragments with existing biological data analysis algorithms helps identify and explain a large number of previously unidentified mutually exclusive altered signaling pathways in seven different cancer types.

A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text

This paper shows that the problem of identifying abbreviations' definitions can be solved with a much simpler algorithm than that proposed by other research efforts, and achieves 96% precision and 82% recall on a standard test collection, which is at least as good as existing approaches.

Adapting a Lexicalized-Grammar Parser to Contrasting Domains

It is demonstrated that a CCG parser can be adapted to two new domains, biomedical text and questions for a QA system, by using manually-annotated training data at the pos and lexical category levels only, which achieves parser accuracy comparable to that on newspaper data without the need for annotated parse trees in the new domain.

Effective Use of Bidirectional Language Modeling for Transfer Learning in Biomedical Named Entity Recognition

This work trains a bidirectional language model (BiLM) on unlabeled data and transfers its weights to "pretrain" an NER model with the same architecture as the BiLM, which results in a better parameter initialization of the NER models.

From POS tagging to dependency parsing for biomedical event extraction

A detailed empirical study comparing traditional feature-based and neural network-based models for POS tagging and dependency parsing in the biomedical context is presented, and the influence of parser selection for a biomedical event extraction downstream task is investigated.

Automatically Adapting an NLP Core Engine to the Biology Domain

In the first evaluation ever of a ML-based ensemble of core NLP components in the biology domain, it is demonstrated that the performance of OpenNLP’s sentence splitter, tokenizer, part- of-speech tagger, chunker and parser matches up with state-of-the-art performance figures from the newspaper domain.

LINNAEUS: A species name identification system for biomedical literature

LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can be integrated into a range of bioinformatics and text-mining applications.

Developing a Robust Part-of-Speech Tagger for Biomedical Text

Experimental results on the Wall Street Journal corpus, the GENIA corpus, and the PennBioIE corpus revealed that adding training data from a different domain does not hurt the performance of a tagger, and the authors' tagger exhibits very good precision on all these corpora.