• Corpus ID: 246411325

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

@article{Smith2022UsingDA,
  title={Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model},
  author={Shaden Smith and Mostofa Ali Patwary and Brandon Norick and Patrick LeGresley and Samyam Rajbhandari and Jared Casper and Zhun Liu and Shrimai Prabhumoye and George Zerveas and Vijay Anand Korthikanti and Elton Zhang and Rewon Child and Reza Yazdani Aminabadi and Julie Bernauer and Xia Song and Mohammad Shoeybi and Yuxiong He and Michael Houston and Saurabh Tiwary and Bryan Catanzaro},
  journal={ArXiv},
  year={2022},
  volume={abs/2201.11990}
}
Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of… 
PaLM: Scaling Language Modeling with Pathways
TLDR
A 540-billion parameter, densely activated, Transformer language model, which is called PaLM achieves breakthrough performance, outperforming the state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark.
Predictability and Surprise in Large Generative Models
TLDR
This paper highlights a counterintuitive property of large-scale generative models, which have a paradoxical combination of predictable loss on a broad training distribution, and unpredictable specific capabilities, inputs, and outputs, and analyzed how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment.
Reducing Activation Recomputation in Large Transformer Models
TLDR
This work presents two novel yet very simple techniques: sequence parallelism and selective activation recomputation, which almost eliminate the need to recompute activations in conjunction with tensor parallelism.
METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals
TLDR
This work conducts a comprehensive empirical study, and proposes a recipe, namely “Model generated dEnoising TRaining Objective” (METRO), which incorporates some of the best modeling techniques developed recently to speed up, stabilize, and enhance pretrained language models without compromising model effectiveness.
Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models
TLDR
This work systematically explore domain-adaptive training to reduce the toxicity of language models and demonstrates that adding and training adapter-only layers in LMs not only saves a lot of parameters but also achieves a better trade-off between toxicity and perplexity than whole model adaptation for the large-scale models.
P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting
TLDR
This paper tuning pre-trained image models with the novel Point-to-Pixel prompting for point cloud analysis at a minor parameter cost achieves 89.3% accuracy on the hardest setting of ScanObjectNN, surpassing conventional point cloud models with much fewer trainable parameters.
giMLPs: Gate with Inhibition Mechanism in MLPs
TLDR
The gate with inhibition on CycleMLP (gi-CycleMLP) can produce the equal performance on ImageNet classification task, and it also improves the BERT, RoBERTa and DeBERTaV3 models depending on two novel techniques.
Training Compute-Optimal Large Language Models
TLDR
This paper trains a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4 × more more data, and reaches a state-of-the-art average accuracy on the MMLU benchmark.
Dive into Big Model Training
TLDR
This report explores what and how the big model training works by diving into training objectives and training methodologies, and summarizes the existingTraining methodologies into three main categories: training parallelism, memory-saving technologies, and model sparsity design.
Grounding Language Models on External Sources: A Comparison of Knowledge Graphs and Unstructured Text
TLDR
Two approaches to ground the language models on a knowledge graph and the other is to use unstructured document collections as external knowledge are compared, investigating how the choice of knowledge representation affects (A) model architecture, (B) ease of training, and (C) model performance on knowledge grounded dialogue.
...
...

References

SHOWING 1-10 OF 74 REFERENCES
RACE: Large-scale ReAding Comprehension Dataset From Examinations
TLDR
The proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state-of-the-art models and the ceiling human performance.
Language Models are Few-Shot Learners
TLDR
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
TLDR
This work presents the Pile, an 825 GiB English text corpus tar-geted at training large-scale language models, constructed from 22 diverse high-quality subsets—both existing and newly constructed—many of which derive from academic or professional sources.
Adversarial NLI: A New Benchmark for Natural Language Understanding
TLDR
This work introduces a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure, and shows that non-expert annotators are successful at finding their weaknesses.
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
TLDR
There is substantial room for improvement in NLI systems, and the HANS dataset can motivate and measure progress in this area, which contains many examples where the heuristics fail.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
TLDR
A simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters and shows that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows.
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
TLDR
This paper presents an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher.
A framework for few-shot language model evaluation, September 2021
  • 2021
...
...