• Corpus ID: 236976189

DEMix Layers: Disentangling Domains for Modular Language Modeling

@article{Gururangan2021DEMixLD,
  title={DEMix Layers: Disentangling Domains for Modular Language Modeling},
  author={Suchin Gururangan and Michael Lewis and Ari Holtzman and Noah A. Smith and Luke Zettlemoyer},
  journal={ArXiv},
  year={2021},
  volume={abs/2108.05036}
}
We introduce a new domain expert mixture (DEM IX ) layer that enables conditioning a language model (LM) on the domain of the input text. A DEM IX layer includes a collection of expert feedforward networks, each specialized to a domain, that makes the LM modular : experts can be mixed, added, or removed after initial training. Extensive experiments with autoregressive transformer LMs (up to 1.3B parameters) show that DEM IX layers reduce test-time perplexity (especially for out-of-domain data… 
KALA: Knowledge-Augmented Language Model Adaptation
TLDR
A novel domain adaption framework for PLMs coined as K nowledge- A ugmented L anguage model A daptation ( KALA), which modulates the intermediate hidden representations of PLMs with domain knowledge, consisting of entities and their relational facts, is proposed.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
TLDR
This work concludes by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixtureof-Experts or ST-MoE-32B), and achieves state-of-the-art performance in transfer learning.
Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora
TLDR
The authors' experiments show distillation-based approaches to be most effective in retaining downstream performance in earlier domains and improve knowledge transfer, allowing models to achieve better downstream performance over latest data, and improve temporal generalization when distribution gaps exist between training and evaluation because of time.
Domain Generalisation of NMT: Fusing Adapters with Leave-One-Domain-Out Training
TLDR
This paper proposes a fusion-based generalisation method that learns to combine domain-specific parameters and proposes a leave-one-domain-out training strategy to avoid information leaking to address the challenge of not knowing the test domain during training time.
Designing Effective Sparse Expert Models
TLDR
This work concludes by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixtureof-Experts or ST-MoE-32B), and achieves state-of-the-art performance in transfer learning.
ELLE: Efficient Lifelong Pre-training for Emerging Data
TLDR
The proposed ELLE consists of function preserved model expansion, which flexibly expands an existing PLM’s width and depth to improve the efficiency of knowledge acquisition, and pre-trained domain prompts, which disentangle the versatile knowledge learned during pre-training and stimulate the proper knowledge for downstream tasks.
Time Waits for No One! Analysis and Challenges of Temporal Misalignment
TLDR
This work establishes a suite of tasks across multiple domains to study temporal misalignment in modern NLP systems and concludes that, while temporal adaptation through continued pretraining can help, these gains are small compared to task-specific finetuning on data from the target time period.
Unified Modeling of Multi-Domain Multi-Device ASR Systems
TLDR
Experiments show that the proposed unified modeling approach actually outperforms the carefully tuned per-domain models, giving relative gains of up to 10% over a baseline model with negligible increase in the number of parameters.
Adapting to the Long Tail: A Meta-Analysis of Transfer Learning Research for Language Understanding Tasks
TLDR
This work reflects on the question: have transfer learning methods sufficiently addressed performance of benchmark-trained models on the long tail, and assesses trends in transfer learning research through a qualitative meta-analysis of 100 representative papers on transfer learning for NLU.
Vocal markers of autism: assessing the generalizability of machine learning models
TLDR
This paper systematically assesses the generalizability of ML models of vocal markers - and more generally biobehavioral markers - of autism across a variety of contexts, finding that they do not generalize well to different, though similar, tasks and not at all to new languages.
...
1
2
...

References

SHOWING 1-10 OF 75 REFERENCES
Plug and Play Language Models: A Simple Approach to Controlled Text Generation
TLDR
The Plug and Play Language Model (PPLM) for controllable language generation is proposed, which combines a pretrained LM with one or more simple attribute classifiers that guide text generation without any further training of the LM.
Multidomain Pretrained Language Models for Green NLP
TLDR
This paper shows that domain adaptation can be generalised to cover multiple domains and a single model can be trained across various domains at the same time with minimal drop in performance, even when the authors use less data and resources.
LAMOL: LAnguage MOdeling for Lifelong Language Learning
TLDR
The results show that LAMOL prevents catastrophic forgetting without any sign of intransigence and can perform five very different language tasks sequentially with only one model.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
TLDR
This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.
DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts
TLDR
This work highlights the promise of tuning small LMs on text with (un)desirable attributes for efficient decoding-time steering and applies DExperts to language detoxification and sentiment-controlled generation, where it outperform existing controllable generation methods on both automatic and human evaluations.
CTRL: A Conditional Transformer Language Model for Controllable Generation
TLDR
CTRL is released, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior, providing more explicit control over text generation.
Parameter-Efficient Transfer Learning for NLP
TLDR
To demonstrate adapter's effectiveness, the recently proposed BERT Transformer model is transferred to 26 diverse text classification tasks, including the GLUE benchmark, and adapter attain near state-of-the-art performance, whilst adding only a few parameters per task.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Unsupervised Domain Clusters in Pretrained Language Models
TLDR
It is shown that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision – suggesting a simple data-driven definition of domains in textual data and proposing domain data selection methods based on such models, which require only a small set of in-domain monolingual data.
Multi-Domain Neural Machine Translation with Word-Level Domain Context Discrimination
TLDR
This paper jointly model NMT with monolingual attention-based domain classification tasks and improves NMT as follows, distinguishing and exploiting word-level domain contexts for multi-domain NMT.
...
1
2
3
4
5
...