• Corpus ID: 215416222

Poor Man's BERT: Smaller and Faster Transformer Models

@article{Sajjad2020PoorMB,
  title={Poor Man's BERT: Smaller and Faster Transformer Models},
  author={Hassan Sajjad and Fahim Dalvi and Nadir Durrani and Preslav Nakov},
  journal={ArXiv},
  year={2020},
  volume={abs/2004.03844}
}
The ongoing neural revolution in Natural Language Processing has recently been dominated by large-scale pre-trained Transformer models, where size does matter: it has been shown that the number of parameters in such a model is typically positively correlated with its performance. Naturally, this situation has unleashed a race for ever larger models, many of which, including the large versions of popular models such as BERT, XLNet, and RoBERTa, are now out of reach for researchers and… 

Figures and Tables from this paper

Greedy-layer pruning: Speeding up transformer models for natural language processing
Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models
TLDR
Approximate Computing, specifically targeting the use of Transformers in NLP tasks, proposes a framework to create smaller, faster and in some cases more accurate models that are faster, smaller and/or more accurate, depending on the user's constraints.
Block Pruning For Faster Transformers
TLDR
This approach extends structured methods by considering blocks of any size and integrates this structure into the movement pruning paradigm for fine-tuning and finds that this approach learns to prune out full components of the underlying model, such as attention heads.
A Practical Survey on Faster and Lighter Transformers
TLDR
This survey investigates popular approaches to make the Transformer faster and lighter and provides a comprehensive explanation of the methods' strengths, limitations, and underlying assumptions to meet the desired trade-off between capacity, computation, and memory.
Teacher-student knowledge distillation from BERT
TLDR
This work distilling BERT into two architecturally diverse students on diverse NLP tasks, and subsequently analysing what the students learnt, demonstrates a novel use of probing for tracing such knowledge back to its origins.
On the Compression of Natural Language Models
TLDR
It was showed that typical dense neural networks contain a small sparse sub-network that can be trained to a reach similar test accuracy in an equal number of steps, and the goal of this work is to assess whether such a trainable subnetwork exists for natural language models (NLM)s.
AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing
TLDR
This comprehensive survey paper explains various core concepts like pretraining, Pretraining methods, pretraining tasks, embeddings and downstream adaptation methods, presents a new taxonomy of T-PTLMs and gives brief overview of various benchmarks including both intrinsic and extrinsic.
The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models
TLDR
Optimal BERT Surgeon (O-BERTS) is introduced, an efficient and accurate weight pruning method based on approximate second-order information, which is investigated when compounding compression approaches for Transformer-based models, which allows for state-of-the-art structured and unstructured pruning together with quantization.
Pruning Neural Machine Translation for Speed Using Group Lasso
TLDR
Group lasso regularisation enables pruning entire rows, columns or blocks of parameters that result in a smaller dense network that pushes the Pareto frontier with respect to the trade-off between time and quality compared to strong baselines.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 39 REFERENCES
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TLDR
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
Q8BERT: Quantized 8Bit BERT
TLDR
This work shows how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by 4x with minimal accuracy loss and the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.
Reducing Transformer Depth on Demand with Structured Dropout
TLDR
LayerDrop, a form of structured dropout, is explored, which has a regularization effect during training and allows for efficient pruning at inference time, and shows that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance.
Are Sixteen Heads Really Better than One?
TLDR
It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance.
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
TLDR
This paper proposes to distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer BiLSTM, as well as its siamese counterpart for sentence-pair tasks, and achieves comparable results with ELMo.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning
TLDR
It is concluded that BERT can be pruned once during pre-training rather than separately for each task without affecting performance, and that fine-tuning BERT on a specific task does not improve its prunability.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
TLDR
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Distilling the Knowledge in a Neural Network
TLDR
This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.
...
1
2
3
4
...