Corpus ID: 211532277

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

@article{Li2020TrainLT,
  title={Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers},
  author={Zhuohan Li and Eric Wallace and Sheng Shen and Kevin Lin and K. Keutzer and D. Klein and J. Gonzalez},
  journal={ArXiv},
  year={2020},
  volume={abs/2002.11794}
}
  • Zhuohan Li, Eric Wallace, +4 authors J. Gonzalez
  • Published 2020
  • Computer Science
  • ArXiv
  • Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models… CONTINUE READING
    37 Citations
    Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search
    • PDF
    EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets
    • 4
    • PDF
    Language Models are Few-Shot Learners
    • 693
    • PDF
    Structured Pruning of a BERT-based Question Answering Model
    • 18
    Principal Component Networks: Parameter Reduction Early in Training
    • PDF
    Weight Distillation: Transferring the Knowledge in Neural Network Parameters
    • PDF
    Pretrained Transformers Improve Out-of-Distribution Robustness
    • 32
    • PDF

    References

    SHOWING 1-10 OF 78 REFERENCES
    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
    • 608
    • PDF
    Efficient Training of BERT by Progressively Stacking
    • 23
    • PDF
    Compression of Neural Machine Translation Models via Pruning
    • 137
    • PDF
    To prune, or not to prune: exploring the efficacy of pruning for model compression
    • 312
    • PDF
    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
    • 1,464
    • Highly Influential
    • PDF
    Reformer: The Efficient Transformer
    • 212
    • PDF
    An Empirical Model of Large-Batch Training
    • 86
    • PDF
    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
    • 992
    • Highly Influential
    • PDF
    Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints
    • 9
    • Highly Influential
    • PDF
    Patient Knowledge Distillation for BERT Model Compression
    • 142
    • PDF