Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

  title={Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models},
  author={Margaret Li and Suchin Gururangan and Tim Dettmers and Mike Lewis and Tim Althoff and Noah A. Smith and Luke Zettlemoyer},
We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large language models (LLMs). We show it is possible to independently train subparts of a new class of LLMs on different subsets of the data, eliminating the massive multi-node synchronization currently required to train LLMs. BTM learns a set of independent EXPERT LMs (ELMs), each specialized to a different textual domain, such as scientific or legal text. These ELMs can be added and… 

Domain-Specific Text Generation for Machine Translation

This work proposes leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation for MT, simulating the domain characteristics of either a small bilingual dataset, or the monolingual source text to be translated, to generate huge amounts of synthetic bilingual in-domain data.

Lo-fi: Distributed Fine-tuning without Communication

By removing the communication requirement, lo-fi reduces resource barriers for fine-tuning large models and enables flne- Tuning in settings with prohibitive communication cost.

A Review of Sparse Expert Models in Deep Learning

The concept of sparse expert models is reviewed, a basic description of the common algorithms is provided, the advances in the deep learning era are contextualized, and areas for future work are highlighted.

Models with Conditional Computation Learn Suboptimal Solutions

It is demonstrated that supervising the routing decision on a small fraction of the examples is sufficient to help the model to learn better routing strategies, shed light on the difficulties of learning effective routing and set the stage for future work on conditional computation mechanisms and training techniques.

ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts

ERNIE-ViLG 2.0 is proposed, a large-scale Chinese text-to-image diffusion model, which progressively upgrades the quality of generated images by incorporatingained textual and visual knowledge of key elements in the scene, and utilizing different denoising experts at differentDenoising stages.



Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Lifting the Curse of Multilinguality by Pre-training Modular Transformers

This work introduces language-specific modules of their Cross-lingual Modular models from the start, which allows them to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant.

DEMix Layers: Disentangling Domains for Modular Language Modeling

A new domain expert mixture (DEMix) layer that enables conditioning a language model (LM) on the domain of the input text, and shows it is possible to add experts to adapt to new domains without forgetting older ones, and remove experts to restrict access to unwanted domains.

Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference

This work investigates routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation and suggests that task-level routing ( task-MoE ) en-ables us to extract smaller, ready-to-deploy sub-networks from large sparse models.

Plug and Play Language Models: A Simple Approach to Controlled Text Generation

The Plug and Play Language Model (PPLM) for controllable language generation is proposed, which combines a pretrained LM with one or more simple attribute classifiers that guide text generation without any further training of the LM.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

This work presents the Pile, an 825 GiB English text corpus tar-geted at training large-scale language models, constructed from 22 diverse high-quality subsets—both existing and newly constructed—many of which derive from academic or professional sources.

Efficient Large Scale Language Modeling with Mixtures of Experts

This paper presents a de-tailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full-shot tuning.

MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer

MAD-X is proposed, an adapter-based framework that enables high portability and parameter-efficient transfer to arbitrary tasks and languages by learning modular language and task representations and introduces a novel invertible adapter architecture and a strong baseline method for adapting a pretrained multilingual model to a new language.

A Simple Method for Commonsense Reasoning

Key to this method is the use of language models, trained on a massive amount of unlabled data, to score multiple choice questions posed by commonsense reasoning tests, which outperform previous state-of-the-art methods by a large margin.