Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

  title={Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models},
  author={Margaret Li and Suchin Gururangan and Tim Dettmers and Mike Lewis and Tim Althoff and Noah A. Smith and Luke Zettlemoyer},
We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large language models (LLMs). We show it is possible to independently train subparts of a new class of LLMs on different subsets of the data, eliminating the massive multi-node synchronization currently required to train LLMs. BTM learns a set of independent EXPERT LMs (ELMs), each specialized to a different textual domain, such as scientific or legal text. These ELMs can be added and… 

Domain-Specific Text Generation for Machine Translation

This work proposes leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation for MT, simulating the domain characteristics of either a small bilingual dataset, or the monolingual source text to be translated, to generate huge amounts of synthetic bilingual in-domain data.

Lo-fi: Distributed Fine-tuning without Communication

By removing the communication requirement, lo-fi reduces resource barriers for fine-tuning large models and enables flne- Tuning in settings with prohibitive communication cost.

A Review of Sparse Expert Models in Deep Learning

The concept of sparse expert models is reviewed, a basic description of the common algorithms is provided, the advances in the deep learning era are contextualized, and areas for future work are highlighted.

Models with Conditional Computation Learn Suboptimal Solutions

It is demonstrated that supervising the routing decision on a small fraction of the examples is sufficient to help the model to learn better routing strategies, shed light on the difficulties of learning effective routing and set the stage for future work on conditional computation mechanisms and training techniques.

ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts

ERNIE-ViLG 2.0 is proposed, a large-scale Chinese text-to-image diffusion model, which progressively upgrades the quality of generated images by incorporatingained textual and visual knowledge of key elements in the scene, and utilizing different denoising experts at differentDenoising stages.

Pre-train, fine-tune, interpolate: a three-stage strategy for domain generalization

The goal of domain generalization is to train models that generalize well to unseen domains by interpolating the featurizer with auxiliary featurizers trained on auxiliary datasets, which improves the performance of existing state-of-the-art models on the DomainBed benchmark.

ColD Fusion: Collaborative Descent for Distributed Multitask Finetuning

It is shown that ColD Fusion yields compa-rable benefits to multitask training by producing a model that attains strong performance on all of the datasets it was multitask trained on and is a better starting point for finetuning on unseen datasets.



Efficient Hierarchical Domain Adaptation for Pretrained Language Models

This paper introduces a method to permit domain adaptation to many diverse domains using a computationally efficient adapter approach based on the observation that textual domains are partially overlapping, and represents domains as a hierarchical tree structure where each node in the tree is associated with a set of adapter weights.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Lifting the Curse of Multilinguality by Pre-training Modular Transformers

This work introduces language-specific modules of their Cross-lingual Modular models from the start, which allows them to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant.

DEMix Layers: Disentangling Domains for Modular Language Modeling

A new domain expert mixture (DEMix) layer that enables conditioning a language model (LM) on the domain of the input text, and shows it is possible to add experts to adapt to new domains without forgetting older ones, and remove experts to restrict access to unwanted domains.

Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference

This work investigates routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation and suggests that task-level routing ( task-MoE ) en-ables us to extract smaller, ready-to-deploy sub-networks from large sparse models.

Plug and Play Language Models: A Simple Approach to Controlled Text Generation

The Plug and Play Language Model (PPLM) for controllable language generation is proposed, which combines a pretrained LM with one or more simple attribute classifiers that guide text generation without any further training of the LM.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

This work presents the Pile, an 825 GiB English text corpus tar-geted at training large-scale language models, constructed from 22 diverse high-quality subsets—both existing and newly constructed—many of which derive from academic or professional sources.

Efficient Large Scale Language Modeling with Mixtures of Experts

This paper presents a de-tailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full-shot tuning.

MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer

MAD-X is proposed, an adapter-based framework that enables high portability and parameter-efficient transfer to arbitrary tasks and languages by learning modular language and task representations and introduces a novel invertible adapter architecture and a strong baseline method for adapting a pretrained multilingual model to a new language.