• Corpus ID: 240354698

Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning

  title={Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning},
  author={Xuanli He and Iman Keivanloo and Yi Xu and Xiang He and Belinda Zeng and Santosh Rajagopalan and Trishul M. Chilimbi},
Pre-training and then fine-tuning large language models is commonly used to achieve state-of-the-art performance in natural language processing (NLP) tasks. However, most pre-trained models suffer from low inference speed. Deploying such large models to applications with latency constraints is challenging. In this work, we focus on accelerating the inference via conditional computations. To achieve this, we propose a novel idea, Magic Pyramid (MP), to reduce both width-wise and depth-wise… 

Figures and Tables from this paper

Transkimmer: Transformer Learns to Layer-wise Skim
The Transkimmer architecture is proposed, which learns to identify hidden state tokens that are not required by each layer that learns to make the skimming decision, and achieves 10.97x average speedup on GLUE benchmark compared with vanilla BERT-base baseline with less than 1% accuracy degradation.
Attribution-based Task-specific Pruning for Multi-task Language Models
Experimental results on the six widely-used datasets show that the proposed pruning method significantly outperforms baseline compression methods and is extended to be applicable in a low-resource setting, where the number of labeled datasets is insuf ficient.


TinyBERT: Distilling BERT for Natural Language Understanding
A novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models is proposed and, by leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT.
SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
SpAtten is presented, an efficient algorithm-architecture co-design that leverages token sparsity, head Sparsity, and quantization opportunities to reduce the attention computation and memory access and proposes the novel cascade token pruning to prune away unimportant tokens in the sentence.
Reducing Transformer Depth on Demand with Structured Dropout
LayerDrop, a form of structured dropout, is explored, which has a regularization effect during training and allows for efficient pruning at inference time, and shows that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
The Right Tool for the Job: Matching Model and Instance Complexities
This work proposes a modification to contextual representation fine-tuning which allows for an early (and fast) “exit” from neural network calculations for simple instances, and late (and accurate) exit for hard instances during inference.
PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination
This work develops a novel method, called PoWER-BERT, for improving the inference time of the popular BERT model, while maintaining the accuracy, and shows that it offers significantly better trade-off between accuracy and inference time compared to prior methods.
Learned Token Pruning for Transformers
A novel token reduction method dubbed Learned Token Pruning (LTP) which adaptively removes unimportant tokens as an input sequence passes through transformer layers, which is more robust than prior methods to variations in input sequence lengths.
DynaBERT: Dynamic BERT with Adaptive Width and Depth
A novel dynamic BERT model, which can run at adaptive width and depth, is proposed (abbreviated as DynaBERT), which has comparable performance as BERT (or RoBERTa), while at smaller widths and depths consistently outperforms existing BERT compression methods.
Patient Knowledge Distillation for BERT Model Compression
This work proposes a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student), which translates into improved results on multiple NLP tasks with a significant gain in training efficiency, without sacrificing model accuracy.