Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

  title={Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks},
  author={Suchin Gururangan and Ana Marasovi{\'c} and Swabha Swayamdipta and Kyle Lo and Iz Beltagy and Doug Downey and Noah A. Smith},
Language models pretrained on text from a wide variety of sources form the foundation of today’s NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to… 

Efficient Domain Adaptation of Language Models via Adaptive Tokenization

This work proposes an alternative approach for transferring pretrained language models to new domains by adapting their tokenizers and shows that domain-specific subword sequences can be determined efficiently directly from divergences in the conditional token distributions of the base and domain- specific corpora.

Back-Translated Task Adaptive Pretraining: Improving Accuracy and Robustness on Text Classification

This work proposes a back-translated task-adaptive pretraining (BT-TAPT) method that increases the amount of task-specific data for LM repretraining by augmenting the task data using back-translation to generalize the LM to the target task domain.

Task-adaptive Pre-training of Language Models with Word Embedding Regularization

A novel fine-tuning process: task-adaptive pre-training with word embedding regularization (TAPTER), which improves the performance of the standard fine- Tuning and the task- AdaptivePre-training on BioASQ and on SQuAD when their pre- training corpora were not dominated by indomain data.

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains

This paper proposes domain-specific vocabulary expansion in the adaptation stage and employ corpus level occurrence probability to choose the size of incremental vocabulary automatically automatically and systematically explores different strategies to compress the large pretrained models for specific domains.

Towards a Robust Question Answering System through Domain-adaptive Pretraining and Data Augmentation

This paper proposes to continue pretraining the LMs on the objective domains, and finds that domain-adaptive pretraining helps improve out-of-domain test performance, and proposes to use data augmentation tricks to maximally utilize these data for domain adaptation purpose.

An Empirical Investigation towards Efficient Multi-Domain Language Model Pre-training

An empirical investigation into known methods to mitigate catastrophic forgetting is conducted and it is found that elastic weight consolidation provides best overall scores yielding only a 0.33% drop in performance across seven generic tasks while remaining competitive in bio-medical tasks.

On the Domain Adaptation and Generalization of Pretrained Language Models: A Survey

A taxonomy of domain adaptation approaches from a machine learning system view is proposed, covering methods for input augmentation, model optimization and personalization, and discusses and compares those methods and suggests promising future research directions.

Revisiting Pretraining with Adapters

This work explores alternatives to full-scale task-specific pretraining of language models through the use of adapter modules, a parameter-efficient approach to transfer learning and finds that adapter-based pretraining is able to achieve comparable results to task- specific pretraining while using a fraction of the overall trainable parameters.

Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation

A Transformer-based Domain-aware N-gram Adaptor, T-DNA, is introduced to effectively learn and incorporate the semantic representation of different combinations of words in the new domain through the adaptation of (word based) n-grams.

Towards Simple and Efficient Task-Adaptive Pre-training for Text Classification

It is shown that training only the BERT embedding layer during TAPT is sufficient to adapt to the vocabulary of the target domain and achieve comparable performance.



Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling

Domain-adaptive fine-tuning offers a simple and effective approach for the unsupervised adaptation of sequence labeling to difficult new domains and is tested on sequence labeling in two challenging domains: Early Modern English and Twitter.

SciBERT: A Pretrained Language Model for Scientific Text

SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks and demonstrates statistically significant improvements over BERT.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Improving Language Understanding by Generative Pre-Training

The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.

Unsupervised Domain Clusters in Pretrained Language Models

It is shown that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision – suggesting a simple data-driven definition of domains in textual data and proposing domain data selection methods based on such models, which require only a small set of in-domain monolingual data.

To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks

The empirical results across diverse NLP tasks with two state-of-the-art models show that the relative performance of fine-tuning vs. feature extraction depends on the similarity of the pretraining and target tasks.

What to do about non-standard (or non-canonical) language in NLP

The notion of canonicity is reviewed, and how it shapes the authors' community's approach to language and will also enable adaptive language technology capable of addressing natural language variation.

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

The benefits of supplementary training with further training on data-rich supervised tasks, such as natural language inference, obtain additional performance improvements on the GLUE benchmark, as well as observing reduced variance across random restarts in this setting.

Publicly Available Clinical BERT Embeddings

This work explores and releases two BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically, and demonstrates that using a domain-specific model yields performance improvements on 3/5 clinical NLP tasks, establishing a new state-of-the-art on the MedNLI dataset.

Pretrained Language Models for Sequential Sentence Classification

This work constructs a joint sentence representation that allows BERT Transformer layers to directly utilize contextual information from all words in all sentences, and achieves state-of-the-art results on four datasets, including a new dataset of structured scientific abstracts.