Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

@article{Gururangan2020DontSP,
  title={Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks},
  author={Suchin Gururangan and Ana Marasovi{\'c} and Swabha Swayamdipta and Kyle Lo and Iz Beltagy and Doug Downey and Noah A. Smith},
  journal={ArXiv},
  year={2020},
  volume={abs/2004.10964}
}
Language models pretrained on text from a wide variety of sources form the foundation of today’s NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to… 
Efficient Domain Adaptation of Language Models via Adaptive Tokenization
TLDR
This work proposes an alternative approach for transferring pretrained language models to new domains by adapting their tokenizers and shows that domain-specific subword sequences can be determined efficiently directly from divergences in the conditional token distributions of the base and domain- specific corpora.
Back-Translated Task Adaptive Pretraining: Improving Accuracy and Robustness on Text Classification
TLDR
This work proposes a back-translated task-adaptive pretraining (BT-TAPT) method that increases the amount of task-specific data for LM repretraining by augmenting the task data using back-translation to generalize the LM to the target task domain.
Task-adaptive Pre-training of Language Models with Word Embedding Regularization
TLDR
A novel fine-tuning process: task-adaptive pre-training with word embedding regularization (TAPTER), which improves the performance of the standard fine- Tuning and the task- AdaptivePre-training on BioASQ and on SQuAD when their pre- training corpora were not dominated by indomain data.
Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains
TLDR
This paper proposes domain-specific vocabulary expansion in the adaptation stage and employ corpus level occurrence probability to choose the size of incremental vocabulary automatically automatically and systematically explores different strategies to compress the large pretrained models for specific domains.
Towards a Robust Question Answering System through Domain-adaptive Pretraining and Data Augmentation
TLDR
This paper proposes to continue pretraining the LMs on the objective domains, and finds that domain-adaptive pretraining helps improve out-of-domain test performance, and proposes to use data augmentation tricks to maximally utilize these data for domain adaptation purpose.
An Empirical Investigation towards Efficient Multi-Domain Language Model Pre-training
TLDR
An empirical investigation into known methods to mitigate catastrophic forgetting is conducted and it is found that elastic weight consolidation provides best overall scores yielding only a 0.33% drop in performance across seven generic tasks while remaining competitive in bio-medical tasks.
Revisiting Pretraining with Adapters
TLDR
This work explores alternatives to full-scale task-specific pretraining of language models through the use of adapter modules, a parameter-efficient approach to transfer learning and finds that adapter-based pretraining is able to achieve comparable results to task- specific pretraining while using a fraction of the overall trainable parameters.
Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation
TLDR
A Transformer-based Domain-aware N-gram Adaptor, T-DNA, is introduced to effectively learn and incorporate the semantic representation of different combinations of words in the new domain through the adaptation of (word based) n-grams.
CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain
TLDR
The results highlight the importance of specialized language models, such as CLIN-X, for concept extraction in non-standard domains, but also show that the task-agnostic model architecture is robust across the tested tasks and languages so that domain- or task-specific adaptations are not required.
MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model
TLDR
Evaluation on nine domain-specific datasets show that a single multilingual domain- specific model can outperform the general multilingual model, and performs close to its monolingual counterpart.
...
...

References

SHOWING 1-10 OF 75 REFERENCES
Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling
TLDR
Domain-adaptive fine-tuning offers a simple and effective approach for the unsupervised adaptation of sequence labeling to difficult new domains and is tested on sequence labeling in two challenging domains: Early Modern English and Twitter.
SciBERT: A Pretrained Language Model for Scientific Text
TLDR
SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks and demonstrates statistically significant improvements over BERT.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Improving Language Understanding by Generative Pre-Training
TLDR
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
Unsupervised Domain Clusters in Pretrained Language Models
TLDR
It is shown that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision – suggesting a simple data-driven definition of domains in textual data and proposing domain data selection methods based on such models, which require only a small set of in-domain monolingual data.
To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks
TLDR
The empirical results across diverse NLP tasks with two state-of-the-art models show that the relative performance of fine-tuning vs. feature extraction depends on the similarity of the pretraining and target tasks.
What to do about non-standard (or non-canonical) language in NLP
TLDR
The notion of canonicity is reviewed, and how it shapes the authors' community's approach to language and will also enable adaptive language technology capable of addressing natural language variation.
Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks
TLDR
The benefits of supplementary training with further training on data-rich supervised tasks, such as natural language inference, obtain additional performance improvements on the GLUE benchmark, as well as observing reduced variance across random restarts in this setting.
An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models
TLDR
This paper combines the task-specific optimization function with an auxiliary language model objective, which is adjusted during the training process, that preserves language regularities captured by language models, while enabling sufficient adaptation for solving the target task.
...
...