Structured Pruning of Large Language Models

@inproceedings{Wang2020StructuredPO,
  title={Structured Pruning of Large Language Models},
  author={Ziheng Wang and Jeremy Wohlwend and Tao Lei},
  booktitle={EMNLP},
  year={2020}
}
Large language models have recently achieved state of the art performance across a wide variety of natural language tasks. Meanwhile, the size of these models and their latency have significantly increased, which makes their usage costly, and raises an interesting question: do language models need to be large? We study this question through the lens of model compression. We present a novel, structured pruning approach based on low rank factorization and augmented Lagrangian L0 norm… 
Compressing Pre-trained Language Models by Matrix Decomposition
TLDR
A two-stage model-compression method to reduce a model’s inference time cost by first decomposing the matrices in the model into smaller matrices and then performing feature distillation on the internal representation to recover from the decomposition.
Know What You Don't Need: Single-Shot Meta-Pruning for Attention Heads
TLDR
This work proposes a method to compress deep pre-trained Transformers before fine-tuning, and trains the Single-Shot Meta-Pruner (SMP) with a meta-learning paradigm aiming to maintain the distribution of text representations after pruning.
ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques
TLDR
The best compressed model, dubbed Refined BERT cOmpreSsion with InTegrAted techniques (ROSITA), is 7.5× smaller than BERT while maintains 98.5% of the performance on five tasks of the GLUE benchmark, outperforming the previous BERT compression methods with similar parameter budget.
PoWER-BERT: Accelerating BERT inference for Classification Tasks
TLDR
This work considers classification tasks and proposes a novel method, called PoWER-BERT, for improving the inference time for the BERT model without significant loss in the accuracy, and shows that compared to the prior inference time reduction methods, this method offers better trade-off between accuracy and inference time.
SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference
TLDR
SarseRT is presented, a code generator that leverage unstructured sparsity to accelerate sparse linear algebra operations in deep learning inference on GPUs and shows speedups of over 5x on use cases in ResNet-50.
Which *BERT? A Survey Organizing Contextualized Encoders
TLDR
A survey on language representation learning is presented with the aim of consolidating a series of shared lessons learned across a variety of recent efforts, and highlights important considerations when interpreting recent contributions and choosing which model to use.
Chinese Named Entity Recognition Method Based on ALBERT
The BERT pre-trained language model has been widely used in Chinese named entity recognition due to its good performance, but the large number of parameters and long training time has limited its
Pruning a BERT-based Question Answering Model
TLDR
This work starts from models trained for SQuAD 2.0 and introduces gates that allow selected parts of transformers to be individually eliminated and finds that a combination of pruning attention heads and the feed-forward layer almost doubles the decoding speed.
BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover’s Distance
TLDR
A novel BERT distillation method based on many-to-many layer mapping, which allows each intermediate student layer to learn from any intermediate teacher layers adaptively for various NLP tasks, is proposed.
Deep Learning Meets Projective Clustering
TLDR
A novel architecture is provided that replaces the original embedding layer by a set of $k$ small layers that operate in parallel and are then recombined with a single fully-connected layer.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 63 REFERENCES
Reducing Transformer Depth on Demand with Structured Dropout
TLDR
LayerDrop, a form of structured dropout, is explored, which has a regularization effect during training and allows for efficient pruning at inference time, and shows that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance.
Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling
TLDR
This work develops a layer selection method for model pruning using sparsity-inducing regularization that can detach any layer without affecting others, and stretch shallow and wide LMs to be deep and narrow.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TLDR
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
Improving Language Understanding by Generative Pre-Training
TLDR
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied.
On the State of the Art of Evaluation in Neural Language Models
TLDR
This work reevaluate several popular architectures and regularisation methods with large-scale automatic black-box hyperparameter tuning and arrives at the somewhat surprising conclusion that standard LSTM architectures, when properly regularised, outperform more recent models.
Compression of Neural Machine Translation Models via Pruning
TLDR
It is shown that an NMT model with over 200 million parameters can be pruned by 40% with very little performance loss as measured on the WMT'14 English-German translation task.
To prune, or not to prune: exploring the efficacy of pruning for model compression
TLDR
Across a broad range of neural network architectures, large-sparse models are found to consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.
ERNIE: Enhanced Language Representation with Informative Entities
TLDR
This paper utilizes both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE) which can take full advantage of lexical, syntactic, and knowledge information simultaneously, and is comparable with the state-of-the-art model BERT on other common NLP tasks.
The State of Sparsity in Deep Neural Networks
TLDR
It is shown that unstructured sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with joint sparsification and optimization, and the need for large-scale benchmarks in the field of model compression is highlighted.
...
1
2
3
4
5
...