DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

  title={DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing},
  author={Conglong Li and Zhewei Yao and Xiaoxia Wu and Minjia Zhang and Yuxiong He},
Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to… 



Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

This work simplifies the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs, and advances the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieves a 4x speedup over the T5-XXL model.

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

DeepSpeed-MoE is presented, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions.

Learned Token Pruning for Transformers

A novel token reduction method dubbed Learned Token Pruning (LTP) which adaptively removes unimportant tokens as an input sequence passes through transformer layers, which is more robust than prior methods to variations in input sequence lengths.

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training

This work presents a novel sequence length warmup method that simultaneously improves training stability and efficiency, and exerts a gradient variance reduction effect and regularizes early stages of training where the amount of training data is much smaller than the model capacity.

The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models

A simple yet effective Sequence Length Warmup method that aims to solve the training stability-efficiency dilemma by avoiding extreme gradient variance values and a lightweight tuning strategy that allows us to tune the authors' method with just a small portion of the expensive full training.

Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search

This paper extends PoWER-BERT and proposes Length-Adaptive Transformer, a transformer that can be used for various inference scenarios after one-shot training and demonstrates the superior accuracy-efficiency trade-off under various setups, including span-based question answering and text classification.

Competence-based Curriculum Learning for Neural Machine Translation

A curriculum learning framework for NMT that reduces training time, reduces the need for specialized heuristics or large batch sizes, and results in overall better performance, which can help improve the training time and the performance of both recurrent neural network models and Transformers.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.

PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination

This work develops a novel method, called PoWER-BERT, for improving the inference time of the popular BERT model, while maintaining the accuracy, and shows that it offers significantly better trade-off between accuracy and inference time compared to prior methods.