Block Pruning For Faster Transformers

  title={Block Pruning For Faster Transformers},
  author={François Lagunas and Ella Charlaix and Victor Sanh and Alexander M. Rush},
Pre-training has improved model accuracy for both classification and generation tasks at the cost of introducing much larger and slower models. Pruning methods have proven to be an effective way of reducing model size, whereas distillation methods are proven for speeding up inference. We introduce a block pruning approach targeting both small and fast models. Our approach extends structured methods by considering blocks of any size and integrates this structure into the movement pruning… 

A Fast Post-Training Pruning Framework for Transformers

A fast post-training pruning framework for Transformers that prunes Transformers in less than 3 minutes on a single GPU, which is over two orders of magnitude faster than existing pruning approaches that retrain.

Gradient-based Intra-attention Pruning on Pre-trained Language Models

This work proposes GRAIN, which inspects intra-attention pruning, and allows different heads to have different sizes, and proposes structure regularization to encourage generating more regular structures, which achieves higher speedups than heterogeneous ones.

Pruning Pretrained Encoders with a Multitask Objective

This work adopts recent strategies for model pruning during finetuning to explore the question of whether it is possible to prune a single encoder so that it can be used for multiple tasks.

PLATON: Pruning Large Transformer Models with Upper Confidence Bound of Weight Importance

PLATON is proposed, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation and tends to retain them and explores their capacity for the weights with low importance scores but high uncertainty.

Prune Once for All: Sparse Pre-Trained Language Models

This work presents a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation, and shows the best compression-to-accuracy ratio for BERT-Base, Bert-Large, and DistilBERT.

LEAP: Learnable Pruning for Transformer-based Models

This work proposes LEArnable Pruning, an effective method to gradually prune the model based on thresholds learned by gradient descent, and introduces a novel regularization function, that directly interacts with the preset target pruning ratio.

SPDY: Accurate Pruning with Speedup Guarantees

SPDY, a new compression method which automatically determines layer-wise sparsity targets achieving a desired inference speedup on a given system, is introduced, while minimizing accuracy loss.

GMP*: Well-Tuned Global Magnitude Pruning Can Outperform Most BERT-Pruning Methods

This work revisits the performance of the classic gradual magnitude pruning baseline for large language models and shows that a simple and general variant, which is called GMP (cid:70) , can match and sometimes outperform more complex state-of-the-art methods.

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

A distilling then pruning framework to compress large vision-language models into smaller, faster, and more accurate ones and a modal-adaptive pruning algorithm to automatically infer the importance of vision and language modalities for different downstream tasks and adaptively remove redundant structures and neurons in different encoders with controllable target sparsity is proposed.

Exploring Extreme Parameter Compression for Pre-trained Language Models

This work aims to explore larger compression ratios for PLMs, among which tensor decomposition is a potential but under-investigated one, and shows that the proposed method is orthogonal to existing compression methods like knowledge distillation.



Poor Man's BERT: Smaller and Faster Transformer Models

A number of memory-light model reduction strategies that do not require model pre-training from scratch are explored, which are able to prune BERT, RoBERTa and XLNet models by up to 40%, while maintaining up to 98% of their original performance.

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

It is shown that large models are more robust to compression techniques such as quantization and pruning than small models, and one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.

Movement Pruning: Adaptive Sparsity by Fine-Tuning

Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes and when combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters.

Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning

It is concluded that BERT can be pruned once during pre-training rather than separately for each task without affecting performance, and that fine-tuning BERT on a specific task does not improve its prunability.

Structured Pruning of a BERT-based Question Answering Model

This paper investigates compressing BERT- and RoBERTa-based question answering systems by structured pruning of parameters from the underlying trained transformer model, and finds that an inexpensive combination of task-specific structured pruned and task- specific distillation yields highly-performing models across a range of speed/accuracy tradeoff operating points.

Reducing Transformer Depth on Demand with Structured Dropout

LayerDrop, a form of structured dropout, is explored, which has a regularization effect during training and allows for efficient pruning at inference time, and shows that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.

Are Sixteen Heads Really Better than One?

It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance.

FastFormers: Highly Efficient Transformer Models for Natural Language Understanding

This paper presents FastFormers, a set of recipes to achieve efficient inference-time performance for Transformer-based models on various NLU tasks and pretrained models, and shows how carefully utilizing knowledge distillation, structured pruning and numerical optimization can lead to drastic improvements on inference efficiency.

Compression of Neural Machine Translation Models via Pruning

It is shown that an NMT model with over 200 million parameters can be pruned by 40% with very little performance loss as measured on the WMT'14 English-German translation task.