• Corpus ID: 243938339

Prune Once for All: Sparse Pre-Trained Language Models

  title={Prune Once for All: Sparse Pre-Trained Language Models},
  author={Ofir Zafrir and Ariel Larey and Guy Boudoukh and Haihao Shen and Moshe Wasserblat},
Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained… 

Figures and Tables from this paper

Structured Pruning Learns Compact and Accurate Models
This work proposes a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning), which delivers highly parallelizable subnetworks and matches the distillation methods in both accuracy and latency, without resorting to any unlabeled data.
Combining Improvements in the Compression of Large Language Models
This work develops theoretical intuition for the proposed combinations of compression methods, revealing a deeper connection through matrix rank, and an impact on generalization error.
Sparse*BERT: Sparse Models are Robust
It is demonstrated that the general sparse model Sparse*BERT can become SparseBioBERT simply by pretraining the compressed architecture on unstructured biomedical text and that Sparse bioberT can match the quality of BioBERT with only 10% of the parameters.
PLATON: Pruning Large Transformer Models with Upper Confidence Bound of Weight Importance
PLATON is proposed, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation and reduces the non-negligible variability due to training dynamics and mini-batch sampling.
LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT
A Transformer-based supernet that is nested with thousands of weight-sharing subnets and design a two-stage distillation strategy to leverage the contextualized latent representations from HuBERT to create a once-for-all Transformer compression framework.
M6-Rec: Generative Pretrained Language Models are Open-Ended Recommender Systems
This paper builds the foundation of a unified foundation model to support open-ended domains and tasks in an industrial recommender system, which may reduce the demand on downstream settings’ data and can minimize the carbon footprint by avoiding training a separate model from scratch for every task.
A Novel Filter Pruning Algorithm for Vision Tasks based on Kernel Grouping
This research revisits a model compression algorithm named Model Diet that can be both applied to involution and convolution models and presents its application on two different tasks, image segmentation and depth estimation.
The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models
Optimal BERT Surgeon (oBERT) is introduced, ancient and accurate weight pruning method based on approximate second-order information, which is shown to yield state-of-the-art results in both stages of language tasks: pre-training and pre-tuning.


TinyBERT: Distilling BERT for Natural Language Understanding
A novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models is proposed and, by leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT.
Q8BERT: Quantized 8Bit BERT
This work shows how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by 4x with minimal accuracy loss and the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
FastFormers: Highly Efficient Transformer Models for Natural Language Understanding
This paper presents FastFormers, a set of recipes to achieve efficient inference-time performance for Transformer-based models on various NLU tasks and pretrained models, and shows how carefully utilizing knowledge distillation, structured pruning and numerical optimization can lead to drastic improvements on inference efficiency.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning
It is concluded that BERT can be pruned once during pre-training rather than separately for each task without affecting performance, and that fine-tuning BERT on a specific task does not improve its prunability.
Transformers: State-of-the-Art Natural Language Processing
Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
This work simplifies the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs, and advances the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieves a 4x speedup over the T5-XXL model.
To prune, or not to prune: exploring the efficacy of pruning for model compression
Across a broad range of neural network architectures, large-sparse models are found to consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.
Language Models are Few-Shot Learners
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.