• Corpus ID: 246411325

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

  title={Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model},
  author={Shaden Smith and Mostofa Ali Patwary and Brandon Norick and Patrick LeGresley and Samyam Rajbhandari and Jared Casper and Zhun Liu and Shrimai Prabhumoye and George Zerveas and Vijay Anand Korthikanti and Elton Zhang and Rewon Child and Reza Yazdani Aminabadi and Julie Bernauer and Xia Song and Mohammad Shoeybi and Yuxiong He and Michael Houston and Saurabh Tiwary and Bryan Catanzaro},
Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of… 

PaLM: Scaling Language Modeling with Pathways

A 540-billion parameter, densely activated, Transformer language model, which is called PaLM achieves breakthrough performance, outperforming the state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark.

Predictability and Surprise in Large Generative Models

This paper highlights a counterintuitive property of large-scale generative models, which have a paradoxical combination of predictable loss on a broad training distribution, and unpredictable specific capabilities, inputs, and outputs, and analyzed how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment.

Reducing Activation Recomputation in Large Transformer Models

This work presents two novel yet very simple techniques: sequence parallelism and selective activation recomputation, which almost eliminate the need to recompute activations in conjunction with tensor parallelism.

METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals

This work conducts a comprehensive empirical study, and proposes a recipe, namely “Model generated dEnoising TRaining Objective” (METRO), which incorporates some of the best modeling techniques developed recently to speed up, stabilize, and enhance pretrained language models without compromising model effectiveness.

Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models

This work systematically explore domain-adaptive training to reduce the toxicity of language models and demonstrates that adding and training adapter-only layers in LMs not only saves a lot of parameters but also achieves a better trade-off between toxicity and perplexity than whole model adaptation for the large-scale models.

Adam Can Converge Without Any Modification on Update Rules

The result shows that Adam can converge under a wide range of hyperparameters without any modification on its update rules, and is the first to prove this result without strong assumptions such as bounded gradients.

Orca: A Distributed Serving System for Transformer-Based Generative Models

  • 2022

P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting

This paper tuning pre-trained image models with the novel Point-to-Pixel prompting for point cloud analysis at a minor parameter cost achieves 89.3% accuracy on the hardest setting of ScanObjectNN, surpassing conventional point cloud models with much fewer trainable parameters.

giMLPs: Gate with Inhibition Mechanism in MLPs

The gate with inhibition on CycleMLP (gi-CycleMLP) can produce the equal performance on ImageNet classification task, and it also improves the BERT, RoBERTa and DeBERTaV3 models depending on two novel techniques.

Training Compute-Optimal Large Language Models

This paper trains a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4 × more more data, and reaches a state-of-the-art average accuracy on the MMLU benchmark.



RACE: Large-scale ReAding Comprehension Dataset From Examinations

The proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state-of-the-art models and the ceiling human performance.

Language Models are Few-Shot Learners

GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

This work presents the Pile, an 825 GiB English text corpus tar-geted at training large-scale language models, constructed from 22 diverse high-quality subsets—both existing and newly constructed—many of which derive from academic or professional sources.

Adversarial NLI: A New Benchmark for Natural Language Understanding

This work introduces a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure, and shows that non-expert annotators are successful at finding their weaknesses.

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

There is substantial room for improvement in NLI systems, and the HANS dataset can motivate and measure progress in this area, which contains many examples where the heuristics fail.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

A simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters and shows that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows.

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

This paper presents an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher.

Ethical and social risks of harm from Language Models

This paper aims to help structure the risk landscape associated with large-scale Language Models (LMs) by analyzing a wide range of established and anticipated risks, drawing on multidisciplinary literature from computer science, linguistics, and social sciences.