• Corpus ID: 220935910

Multi-node Bert-pretraining: Cost-efficient Approach

@article{Lin2020MultinodeBC,
  title={Multi-node Bert-pretraining: Cost-efficient Approach},
  author={Jiahuang Lin and X. Li and Gennady Pekhimenko},
  journal={ArXiv},
  year={2020},
  volume={abs/2008.00177}
}
Recently, large scale Transformer-based language models such as BERT, GPT-2, and XLNet have brought about exciting leaps in state-of-the-art results for many Natural Language Processing (NLP) tasks. One of the common trends in these recent models is a significant increase in model complexity, which introduces both more weights and computation. Moreover, with the advent of large-scale unsupervised datasets, training time is further extended due to the increased amount of data samples within a… 

SWARM PARALLELISM: TRAINING LARGE MODELS CAN BE SURPRISINGLY COMMUNICATION-EFFICIENT

  • Computer Science
  • 2021
TLDR
This work proposes SWARM Parallelism — a model-parallel training algorithm designed for swarms of poorly connected, heterogeneous unreliable devices that creates temporary randomized pipelines between available nodes that are rebalanced in case of failure.

Distributed Deep Learning in Open Collaborations

TLDR
This work carefully analyze constraints and proposes a novel algorithmic framework designed specifically for collaborative training for SwAV and ALBERT pretraining in realistic conditions and achieves performance comparable to traditional setups at a fraction of the cost.

Workload characterization of a time-series prediction system for spatio-temporal data

TLDR
A proxy application for deep learning based time-series application that uses spatio-temporal data from a dynamical system for model training and inference and computational profiles of Tensorflow and PyTorch mostly exhibit divergent overheads across GPU platforms are developed.

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices

TLDR
This work proposes Moshpit All-Reduce — an iterative averaging protocol that exponentially converges to the global average that demonstrates the efficiency of this protocol for distributed optimization with strong theoretical guarantees.

The Design Process for Google's Training Chips: TPUv2 and TPUv3

TLDR
The circumstances that led to this outcome, the challenges and opportunities observed, the approach taken for the chips, a quick review of performance, and finally a retrospective on the results are detailed.

References

SHOWING 1-10 OF 33 REFERENCES

Reducing BERT Pre-Training Time from 3 Days to 76 Minutes

TLDR
The LAMB optimizer is proposed, which helps to scale the batch size to 65536 without losing accuracy, and is a general optimizer that works for both small and large batch sizes and does not need hyper-parameter tuning besides the learning rate.

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

TLDR
GPipe is introduced, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers by pipelining different sub-sequences of layers on separate accelerators, resulting in almost linear speedup when a model is partitioned across multiple accelerators.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

TLDR
This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.

Measuring the Effects of Data Parallelism on Neural Network Training

TLDR
This work experimentally characterize the effects of increasing the batch size on training time, as measured by the number of steps necessary to reach a goal out-of-sample error, and study how this relationship varies with the training algorithm, model, and data set, and finds extremely large variation between workloads.

Echo: Compiler-based GPU Memory Footprint Reduction for LSTM RNN Training

TLDR
Echo is a new compiler-based optimization scheme that addresses the first challenge with a practical mechanism that estimates the memory benefits of recomputation over the entire computation graph, and the second challenge by non-conservatively estimating the recomPUTation runtime overhead leveraging layer specifics.

Massive Exploration of Neural Machine Translation Architectures

TLDR
This work presents a large-scale analysis of the sensitivity of NMT architectures to common hyperparameters, and reports empirical results and variance numbers for several hundred experimental runs corresponding to over 250,000 GPU hours on a WMT English to German translation task.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

TLDR
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Energy and Policy Considerations for Deep Learning in NLP

TLDR
This paper quantifies the approximate financial and environmental costs of training a variety of recently successful neural network models for NLP and proposes actionable recommendations to reduce costs and improve equity in NLP research and practice.

Attention is All you Need

TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.