• Corpus ID: 238634311

LightSeq: Accelerated Training for Transformer-based Models on GPUs

@article{Wang2021LightSeqAT,
  title={LightSeq: Accelerated Training for Transformer-based Models on GPUs},
  author={Xiaohui Wang and Ying Xiong and Xian Qian and Yang Wei and Lei Li and Mingxuan Wang},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.05722}
}
—Transformer-based neural models are used in many AI applications. Training these models is expensive, as it takes huge GPU resources and long duration. It is challenging be- cause typical data like sentences have variable lengths, and Transformer’s computation patterns are more complex than convolutional neural networks. Existing systems either only focus on model inference or optimization for only BERT-like encoder models. In this paper, we present LightSeq2, a system to accelerate training… 

Benchmark Assessment for DeepSpeed Optimization Library

Evaluating Microsoft DeepSpeed library through classification tasks and evaluating the library on several modern neural network architectures, including convolutional neural networks (CNNs) and Vision Transformer, indicated that DeepSpeed, while can make improvements in some of those cases, it has no or negative impact on others.

Boosting Distributed Training Performance of the Unpadded BERT Model

A general structure for the variable-length BERT models is proposed, and the overall performance of the BERT model is optimized, such as kernel fusion, and operator optimization.

FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

FastFold is proposed, a highly efficient implementation of protein structure prediction model for training and inference that includes a series of GPU optimizations based on thorough analysis of AlphaFold’s performance and achieves high model parallelism scaling efficiency, surpassing existing popular model Parallelism techniques.

HelixFold: An Efficient Implementation of AlphaFold2 using PaddlePaddle

HelixFold’s accuracy could be on par with AlphaFold2 on the CASP14 and CAMEO datasets and Helix- Fold saves 1x training time, which could accelerate the development of life science.

References

SHOWING 1-10 OF 41 REFERENCES

TurboTransformers: an efficient GPU serving system for transformer models

A transformer serving system called TurboTransformers, which consists of a computing runtime and a serving framework that can achieve the state-of-the-art transformer model serving performance on GPU platforms and can be seamlessly integrated into PyTorch code with a few lines of code.

LightSeq: A High Performance Inference Library for Transformers

A highly efficient inference library for models in the Transformer family that includes a series of GPU optimization techniques to both streamline the computation of Transformer layers and reduce memory footprint.

Learning Light-Weight Translation Models from Deep Transformer

A novel group-permutation based knowledge distillation approach to compressing the deep Transformer model into a shallow model, which is 8 times shallower than the deep model, with almost no loss in BLEU.

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

It is shown that large models are more robust to compression techniques such as quantization and pruning than small models, and one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Scaling Vision Transformers

A ViT model with two billion parameters is successfully trained, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy and performs well for few-shot transfer.

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

This work proposes a method based on progressive layer dropping that speeds the training of Transformer-based language models, not at the cost of excessive hardware resources but from model architecture change and training technique boosted efficiency.

Conformer: Convolution-augmented Transformer for Speech Recognition

This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.

Random Feature Attention

RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, is proposed and explored, showing that RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets.

Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks

This work proposes a hybrid FP8 (HFP8) format and DNN end-to-end distributed training procedure and demonstrates, using HFP8, the successful training of deep learning models across a whole spectrum of applications including Image Classification, Object Detection, Language and Speech without accuracy degradation.