Efficient Hierarchical Domain Adaptation for Pretrained Language Models

  title={Efficient Hierarchical Domain Adaptation for Pretrained Language Models},
  author={Alexandra Chronopoulou and Matthew E. Peters and Jesse Dodge},
The remarkable success of large language models has been driven by dense models trained on massive unlabeled, unstructured corpora. These corpora typically contain text from diverse, heterogeneous sources, but information about the source of the text is rarely used during training. Transferring their knowledge to a target domain is typically done by continuing training in-domain. In this paper, we introduce a method to permit domain adaptation to many diverse domains using a computationally… 

Figures and Tables from this paper

Mix and Match: Learning-free Controllable Text Generationusing Energy Language Models
This work proposes Mix and Match LM, a global score-based alternative for controllable text generation that combines arbitrary pre-trained black- box models for achieving the desired attributes in the generated text without involving any fine-tuning or structural assumptions about the black-box models.
Geographic Adaptation of Pretrained Language Models
This work introduces an approach to task-agnostic geoadaptation of PLMs that forces the PLM to learn associations between linguistic phenomena and geographic locations and obtains a state of the art in supervised geolocation prediction and report massive gains over geographically uninformed PLMs on zero-shot geolocated prediction.
Time Waits for No One! Analysis and Challenges of Temporal Misalignment
It is found that, while temporal adaptation through continued pretraining can help, these gains are small compared to task-specific finetuning on data from the target time period, which motivates continued research to improve temporal robustness of NLP models.


DEMix Layers: Disentangling Domains for Modular Language Modeling
A new domain expert mixture (DEMix) layer that enables conditioning a language model (LM) on the domain of the input text, and shows it is possible to add experts to adapt to new domains without forgetting older ones, and remove experts to restrict access to unwanted domains.
Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
It is consistently found that multi-phase adaptive pretraining offers large gains in task performance, and it is shown that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable.
MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer
MAD-X is proposed, an adapter-based framework that enables high portability and parameter-efficient transfer to arbitrary tasks and languages by learning modular language and task representations and introduces a novel invertible adapter architecture and a strong baseline method for adapting a pretrained multilingual model to a new language.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Parameter-Efficient Transfer Learning for NLP
To demonstrate adapter's effectiveness, the recently proposed BERT Transformer model is transferred to 26 diverse text classification tasks, including the GLUE benchmark, and adapter attain near state-of-the-art performance, whilst adding only a few parameters per task.
Learning multiple visual domains with residual adapters
This paper develops a tunable deep network architecture that, by means of adapter residual modules, can be steered on the fly to diverse visual domains and introduces the Visual Decathlon Challenge, a benchmark that evaluates the ability of representations to capture simultaneously ten very differentVisual domains and measures their ability to recognize well uniformly.
Efficient Large Scale Language Modeling with Mixtures of Experts
This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in and out-of-domain language modeling, zero and few-shot priming, and full finetuning.
Multilingual Unsupervised Neural Machine Translation with Denoising Adapters
This paper proposes to use _denoising adapters_, adapter layers with a denoising objective, on top of pre-trained mBART-50, and shows that the resulting translations are on-par with back-translating as measured by BLEU, and furthermore it allows adding unseen languages incrementally.
Towards a Unified View of Parameter-Efficient Transfer Learning
This paper re-frames state-of-the-art parameter-efficient transfer learning methods as modifications to specific hidden states in pretrained models, and defines a set of design dimensions along which different methods vary, achieving comparable results to fine-tuning all parameters on all four tasks.