DEMix Layers: Disentangling Domains for Modular Language Modeling
@article{Gururangan2021DEMixLD, title={DEMix Layers: Disentangling Domains for Modular Language Modeling}, author={Suchin Gururangan and Michael Lewis and Ari Holtzman and Noah A. Smith and Luke Zettlemoyer}, journal={ArXiv}, year={2021}, volume={abs/2108.05036} }
We introduce a new domain expert mixture (DEM IX ) layer that enables conditioning a language model (LM) on the domain of the input text. A DEM IX layer includes a collection of expert feedforward networks, each specialized to a domain, that makes the LM modular : experts can be mixed, added, or removed after initial training. Extensive experiments with autoregressive transformer LMs (up to 1.3B parameters) show that DEM IX layers reduce test-time perplexity (especially for out-of-domain data…
Figures and Tables from this paper
12 Citations
KALA: Knowledge-Augmented Language Model Adaptation
- Computer ScienceArXiv
- 2022
A novel domain adaption framework for PLMs coined as K nowledge- A ugmented L anguage model A daptation ( KALA), which modulates the intermediate hidden representations of PLMs with domain knowledge, consisting of entities and their relational facts, is proposed.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
- Computer Science
- 2022
This work concludes by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixtureof-Experts or ST-MoE-32B), and achieves state-of-the-art performance in transfer learning.
Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora
- Computer ScienceBIGSCIENCE
- 2022
The authors' experiments show distillation-based approaches to be most effective in retaining downstream performance in earlier domains and improve knowledge transfer, allowing models to achieve better downstream performance over latest data, and improve temporal generalization when distribution gaps exist between training and evaluation because of time.
Domain Generalisation of NMT: Fusing Adapters with Leave-One-Domain-Out Training
- Computer ScienceFINDINGS
- 2022
This paper proposes a fusion-based generalisation method that learns to combine domain-specific parameters and proposes a leave-one-domain-out training strategy to avoid information leaking to address the challenge of not knowing the test domain during training time.
Designing Effective Sparse Expert Models
- Computer ScienceArXiv
- 2022
This work concludes by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixtureof-Experts or ST-MoE-32B), and achieves state-of-the-art performance in transfer learning.
ELLE: Efficient Lifelong Pre-training for Emerging Data
- Computer ScienceFINDINGS
- 2022
The proposed ELLE consists of function preserved model expansion, which flexibly expands an existing PLM’s width and depth to improve the efficiency of knowledge acquisition, and pre-trained domain prompts, which disentangle the versatile knowledge learned during pre-training and stimulate the proper knowledge for downstream tasks.
Time Waits for No One! Analysis and Challenges of Temporal Misalignment
- Computer ScienceArXiv
- 2021
This work establishes a suite of tasks across multiple domains to study temporal misalignment in modern NLP systems and concludes that, while temporal adaptation through continued pretraining can help, these gains are small compared to task-specific finetuning on data from the target time period.
Unified Modeling of Multi-Domain Multi-Device ASR Systems
- Computer ScienceArXiv
- 2022
Experiments show that the proposed unified modeling approach actually outperforms the carefully tuned per-domain models, giving relative gains of up to 10% over a baseline model with negligible increase in the number of parameters.
Adapting to the Long Tail: A Meta-Analysis of Transfer Learning Research for Language Understanding Tasks
- Computer ScienceArXiv
- 2021
This work reflects on the question: have transfer learning methods sufficiently addressed performance of benchmark-trained models on the long tail, and assesses trends in transfer learning research through a qualitative meta-analysis of 100 representative papers on transfer learning for NLU.
Vocal markers of autism: assessing the generalizability of machine learning models
- PsychologybioRxiv
- 2021
This paper systematically assesses the generalizability of ML models of vocal markers - and more generally biobehavioral markers - of autism across a variety of contexts, finding that they do not generalize well to different, though similar, tasks and not at all to new languages.
References
SHOWING 1-10 OF 75 REFERENCES
Plug and Play Language Models: A Simple Approach to Controlled Text Generation
- Computer ScienceICLR
- 2020
The Plug and Play Language Model (PPLM) for controllable language generation is proposed, which combines a pretrained LM with one or more simple attribute classifiers that guide text generation without any further training of the LM.
Multidomain Pretrained Language Models for Green NLP
- Computer ScienceADAPTNLP
- 2021
This paper shows that domain adaptation can be generalised to cover multiple domains and a single model can be trained across various domains at the same time with minimal drop in performance, even when the authors use less data and resources.
LAMOL: LAnguage MOdeling for Lifelong Language Learning
- Computer ScienceICLR
- 2020
The results show that LAMOL prevents catastrophic forgetting without any sign of intransigence and can perform five very different language tasks sequentially with only one model.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
- Computer ScienceICLR
- 2017
This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.
DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts
- Computer ScienceACL
- 2021
This work highlights the promise of tuning small LMs on text with (un)desirable attributes for efficient decoding-time steering and applies DExperts to language detoxification and sentiment-controlled generation, where it outperform existing controllable generation methods on both automatic and human evaluations.
CTRL: A Conditional Transformer Language Model for Controllable Generation
- Computer ScienceArXiv
- 2019
CTRL is released, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior, providing more explicit control over text generation.
Parameter-Efficient Transfer Learning for NLP
- Computer ScienceICML
- 2019
To demonstrate adapter's effectiveness, the recently proposed BERT Transformer model is transferred to 26 diverse text classification tasks, including the GLUE benchmark, and adapter attain near state-of-the-art performance, whilst adding only a few parameters per task.
Language Models are Unsupervised Multitask Learners
- Computer Science
- 2019
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Unsupervised Domain Clusters in Pretrained Language Models
- Computer ScienceACL
- 2020
It is shown that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision – suggesting a simple data-driven definition of domains in textual data and proposing domain data selection methods based on such models, which require only a small set of in-domain monolingual data.
Multi-Domain Neural Machine Translation with Word-Level Domain Context Discrimination
- Computer ScienceEMNLP
- 2018
This paper jointly model NMT with monolingual attention-based domain classification tasks and improves NMT as follows, distinguishing and exploiting word-level domain contexts for multi-domain NMT.