SciBERT: A Pretrained Language Model for Scientific Text

  title={SciBERT: A Pretrained Language Model for Scientific Text},
  author={Iz Beltagy and Kyle Lo and Arman Cohan},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et. al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence… 

Tables from this paper

NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework

A simple and efficient learning framework TLM that does not rely on large-scale pretraining and uses task data as queries to retrieve a tiny subset of the general corpus and jointly optimizes the task objective and the language modeling objective from scratch.

Low-Resource Adaptation of Neural NLP Models

This thesis develops and adapt neural NLP models to explore a number of research questions concerning NLP tasks with minimal or no training data and investigates methods for dealing with low-resource scenarios in information extraction and natural language understanding.

Multi-Stage Pretraining for Low-Resource Domain Adaptation

Transfer learning techniques are particularly useful in NLP tasks where a sizable amount of high-quality annotated data is difficult to obtain. Current approaches directly adapt a pre-trained

FinBERT: A Pretrained Language Model for Financial Communications

This work addresses the need by pretraining a financial domain specific BERT models, FinberT, using a large scale of financial communication corpora, and confirms the advantage of FinBERT over generic domain BERT model.

Rethinking Relational Encoding in Language Model: Pre-Training for General Sequences

It is posited that while LMPT can effectively model pertoken relations, it fails at modeling per-sequence relations in non-natural language domains, and a framework is developed that couples LMPt with deep structure-preserving metric learning to produce richer embeddings than can be obtained from L MPT alone.


The results show that extra information can improve the identification of cited text spans and the end-to-end trained models outperform models trained with two stages, and the averaged prediction of multi-models is more accurate than an individual one.

Detecting ESG topics using domain-specific language models and data augmentation approaches

This work experiments with further language model pre-training using large amounts of in-domain data from business and financial news and applies augmentation approaches to increase the size of the dataset for model fine-tuning, demonstrating that both approaches are beneficial to accuracy in classification tasks.

SciNLI: A Corpus for Natural Language Inference on Scientific Text

SciNLI is a large dataset for NLI that captures the formality in scientific text and contains 107,412 sentence pairs extracted from scholarly papers on NLP and computational linguistics, well suited to serve as a benchmark for the evaluation of scientific NLU models.

Predictions For Pre-training Language Models

This paper investigates whether it is still helpful to add the specific task's loss in pre-training step and uses the fine-tuned model to give the user-generated unlabeled data a pseudo-label to pre-train.

IITKGP at W-NUT 2020 Shared Task-1: Domain specific BERT representation for Named Entity Recognition of lab protocol

This paper illustrates the System for Named Entity Tagging based on Bio-Bert and shows that the model gives substantial improvements over the baseline and stood the fourth runner up in terms of F1 score, and first runner up on the range of Recall with just 2.21 F1 scores behind the best one.



Universal Language Model Fine-tuning for Text Classification

This work proposes Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduces techniques that are key for fine- Tuning a language model.

Publicly Available Clinical BERT Embeddings

This work explores and releases two BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically, and demonstrates that using a domain-specific model yields performance improvements on 3/5 clinical NLP tasks, establishing a new state-of-the-art on the MedNLI dataset.

AllenNLP: A Deep Semantic Natural Language Processing Platform

AllenNLP is described, a library for applying deep learning methods to NLP research that addresses issues with easy-to-use command-line tools, declarative configuration-driven experiments, and modular NLP abstractions.

Improving Language Understanding by Generative Pre-Training

The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks

This paper evaluates the importance of different network design choices and hyperparameters for five common linguistic sequence tagging tasks and found, that some parameters, like the pre-trained word embeddings or the last layer of the network, have a large impact on the performance, while other parameters, for example the number of LSTM layers or theNumber of recurrent units, are of minor importance.

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

A novel neutral network architecture is introduced that benefits from both word- and character-level representations automatically, by using combination of bidirectional LSTM, CNN and CRF, thus making it applicable to a wide range of sequence labeling tasks.

Deep Biaffine Attention for Neural Dependency Parsing

This paper uses a larger but more thoroughly regularized parser than other recent BiLSTM-based approaches, with biaffine classifiers to predict arcs and labels, and shows which hyperparameter choices had a significant effect on parsing accuracy, allowing it to achieve large gains over other graph-based approach.

ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

ScispaCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library is described, which detail the performance of two packages of models released in scispa Cy and demonstrate their robustness on several tasks and datasets.

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction

The multi-task setup reduces cascading errors between tasks and leverages cross-sentence relations through coreference links and supports construction of a scientific knowledge graph, which is used to analyze information in scientific literature.