Block Pruning For Faster Transformers
@article{Lagunas2021BlockPF, title={Block Pruning For Faster Transformers}, author={François Lagunas and Ella Charlaix and Victor Sanh and Alexander M. Rush}, journal={ArXiv}, year={2021}, volume={abs/2109.04838} }
Pre-training has improved model accuracy for both classification and generation tasks at the cost of introducing much larger and slower models. Pruning methods have proven to be an effective way of reducing model size, whereas distillation methods are proven for speeding up inference. We introduce a block pruning approach targeting both small and fast models. Our approach extends structured methods by considering blocks of any size and integrates this structure into the movement pruning…
Figures and Tables from this paper
44 Citations
A Fast Post-Training Pruning Framework for Transformers
- Computer ScienceArXiv
- 2022
A fast post-training pruning framework for Transformers that prunes Transformers in less than 3 minutes on a single GPU, which is over two orders of magnitude faster than existing pruning approaches that retrain.
Gradient-based Intra-attention Pruning on Pre-trained Language Models
- Computer ScienceArXiv
- 2022
This work proposes GRAIN, which inspects intra-attention pruning, and allows different heads to have different sizes, and proposes structure regularization to encourage generating more regular structures, which achieves higher speedups than heterogeneous ones.
Pruning Pretrained Encoders with a Multitask Objective
- Computer ScienceArXiv
- 2021
This work adopts recent strategies for model pruning during finetuning to explore the question of whether it is possible to prune a single encoder so that it can be used for multiple tasks.
PLATON: Pruning Large Transformer Models with Upper Confidence Bound of Weight Importance
- Computer ScienceICML
- 2022
PLATON is proposed, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation and tends to retain them and explores their capacity for the weights with low importance scores but high uncertainty.
Prune Once for All: Sparse Pre-Trained Language Models
- Computer ScienceArXiv
- 2021
This work presents a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation, and shows the best compression-to-accuracy ratio for BERT-Base, Bert-Large, and DistilBERT.
LEAP: Learnable Pruning for Transformer-based Models
- Computer Science
- 2021
This work proposes LEArnable Pruning, an effective method to gradually prune the model based on thresholds learned by gradient descent, and introduces a novel regularization function, that directly interacts with the preset target pruning ratio.
SPDY: Accurate Pruning with Speedup Guarantees
- Computer ScienceICML
- 2022
SPDY, a new compression method which automatically determines layer-wise sparsity targets achieving a desired inference speedup on a given system, is introduced, while minimizing accuracy loss.
GMP*: Well-Tuned Global Magnitude Pruning Can Outperform Most BERT-Pruning Methods
- Computer ScienceArXiv
- 2022
This work revisits the performance of the classic gradual magnitude pruning baseline for large language models and shows that a simple and general variant, which is called GMP (cid:70) , can match and sometimes outperform more complex state-of-the-art methods.
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning
- Computer ScienceArXiv
- 2022
A distilling then pruning framework to compress large vision-language models into smaller, faster, and more accurate ones and a modal-adaptive pruning algorithm to automatically infer the importance of vision and language modalities for different downstream tasks and adaptively remove redundant structures and neurons in different encoders with controllable target sparsity is proposed.
Exploring Extreme Parameter Compression for Pre-trained Language Models
- Computer ScienceICLR
- 2022
This work aims to explore larger compression ratios for PLMs, among which tensor decomposition is a potential but under-investigated one, and shows that the proposed method is orthogonal to existing compression methods like knowledge distillation.
References
SHOWING 1-10 OF 43 REFERENCES
Poor Man's BERT: Smaller and Faster Transformer Models
- Computer ScienceArXiv
- 2020
A number of memory-light model reduction strategies that do not require model pre-training from scratch are explored, which are able to prune BERT, RoBERTa and XLNet models by up to 40%, while maintaining up to 98% of their original performance.
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers
- Computer ScienceICML
- 2020
It is shown that large models are more robust to compression techniques such as quantization and pruning than small models, and one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.
Movement Pruning: Adaptive Sparsity by Fine-Tuning
- Computer ScienceNeurIPS
- 2020
Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes and when combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters.
Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning
- Computer ScienceREPL4NLP
- 2020
It is concluded that BERT can be pruned once during pre-training rather than separately for each task without affecting performance, and that fine-tuning BERT on a specific task does not improve its prunability.
Structured Pruning of a BERT-based Question Answering Model
- Computer Science
- 2019
This paper investigates compressing BERT- and RoBERTa-based question answering systems by structured pruning of parameters from the underlying trained transformer model, and finds that an inexpensive combination of task-specific structured pruned and task- specific distillation yields highly-performing models across a range of speed/accuracy tradeoff operating points.
Reducing Transformer Depth on Demand with Structured Dropout
- Computer ScienceICLR
- 2020
LayerDrop, a form of structured dropout, is explored, which has a regularization effect during training and allows for efficient pruning at inference time, and shows that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
- Computer ScienceArXiv
- 2019
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
Are Sixteen Heads Really Better than One?
- Computer ScienceNeurIPS
- 2019
It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance.
FastFormers: Highly Efficient Transformer Models for Natural Language Understanding
- Computer ScienceSUSTAINLP
- 2020
This paper presents FastFormers, a set of recipes to achieve efficient inference-time performance for Transformer-based models on various NLU tasks and pretrained models, and shows how carefully utilizing knowledge distillation, structured pruning and numerical optimization can lead to drastic improvements on inference efficiency.
Compression of Neural Machine Translation Models via Pruning
- Computer ScienceCoNLL
- 2016
It is shown that an NMT model with over 200 million parameters can be pruned by 40% with very little performance loss as measured on the WMT'14 English-German translation task.