The Right Tool for the Job: Matching Model and Instance Complexities
@inproceedings{Schwartz2020TheRT, title={The Right Tool for the Job: Matching Model and Instance Complexities}, author={Roy Schwartz and Gabriel Stanovsky and Swabha Swayamdipta and Jesse Dodge and Noah A. Smith}, booktitle={ACL}, year={2020} }
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs. To better respect a given inference budget, we propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) “exit” from neural network calculations for simple instances, and late (and accurate) exit for hard instances. To achieve this, we add classifiers to different layers of BERT and use their…
Figures and Tables from this paper
62 Citations
Towards Efficient NLP: A Standard Evaluation and A Strong Baseline
- Computer ScienceNAACL
- 2022
The proposed ELUE has a strong Pareto Frontier and makes a better evaluation for efficient NLP models, and a strong baseline, ElasticBERT, which allows BERT to exit at any layer in both static and dynamic ways is released.
Elbert: Fast Albert with Confidence-Window Based Early Exit
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
The ELBERT is proposed, which significantly improves the average inference speed compared to ALBERT due to the proposed confidence-window based early exit mechanism, without introducing additional parameters or extra training overhead.
Early Exiting BERT for Efficient Document Ranking
- Computer ScienceSUSTAINLP
- 2020
Early exiting BERT is introduced for document ranking with a slight modification, BERT becomes a model with multiple output paths, and each inference sample can exit early from these paths, so computation can be effectively allocated among samples.
CascadeBERT: Accelerating Inference of Pre-trained Language Models via Calibrated Complete Models Cascade
- Computer ScienceEMNLP
- 2021
CascadeBERT is proposed, which dynamically selects proper-sized and complete models in a cascading manner, providing comprehensive representations for predictions, and can achieve an overall 15% improvement under 4× speed-up compared with existing dynamic early exiting methods.
AdapLeR: Speeding up Inference by Adaptive Length Reduction
- Computer ScienceACL
- 2022
This work proposes a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance, and dynamically eliminates less contributing tokens through layers, resulting in shorter lengths and consequently lower computational cost.
When in Doubt, Summon the Titans: Efficient Inference with Large Models
- Computer ScienceArXiv
- 2021
The proposed use of distillation to only handle easy instances allows for a more aggressive trade-off in the student size, thereby reducing the amortized cost of inference and achieving better accuracy than standard distillation.
LeeBERT: Learned Early Exit for BERT with cross-level optimization
- Computer ScienceACL
- 2021
A novel training scheme called Learned Early Exit for BERT (LeeBERT), which asks each exit to learn from each other, rather than learning only from the last layer, and formulate the optimization of LeeBERT as a bi-level optimization problem, and proposes a novel cross- level optimization (CLO) algorithm to improve the optimization results.
Confident Adaptive Language Modeling
- Computer ScienceArXiv
- 2022
This work introduces Confident Adaptive Language Modeling (CALM), a framework for dynamically allocating different amounts of compute per input and generation timestep, and demonstrates theacy of the framework in reducing compute while provably maintaining high performance.
Accelerating Pre-trained Language Models via Calibrated Cascade
- Computer ScienceArXiv
- 2020
The working mechanism of dynamic early exiting is analyzed and it is found it cannot achieve a satisfying trade-off between inference speed and performance, so CascadeBERT is proposed, which dynamically selects a proper-sized, complete model in a cascading manner.
Accelerating BERT Inference for Sequence Labeling via Early-Exit
- Computer ScienceACL
- 2021
The token-level early-exit mechanism brings the gap between training and inference, so an extra self-sampling fine-tuning stage is introduced to alleviate it and can save up to 66% ∼75% inference cost with minimal performance degradation.
References
SHOWING 1-10 OF 55 REFERENCES
Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling
- Computer ScienceEMNLP
- 2018
This work develops a layer selection method for model pruning using sparsity-inducing regularization that can detach any layer without affecting others, and stretch shallow and wide LMs to be deep and narrow.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
- Computer ScienceArXiv
- 2019
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
Show Your Work: Improved Reporting of Experimental Results
- Computer ScienceEMNLP
- 2019
It is demonstrated that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best, and a novel technique is presented: expected validation performance of the best-found model as a function of computation budget.
TinyBERT: Distilling BERT for Natural Language Understanding
- Computer ScienceFINDINGS
- 2020
A novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models is proposed and, by leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT.
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
- Computer ScienceArXiv
- 2019
This paper proposes to distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer BiLSTM, as well as its siamese counterpart for sentence-pair tasks, and achieves comparable results with ELMo.
RNN Architecture Learning with Sparse Regularization
- Computer ScienceEMNLP
- 2019
This work applies group lasso to rational RNNs (Peng et al., 2018), a family of models that is closely connected to weighted finite-state automata (WFSAs) and shows that sparsifying such models makes them easier to visualize, and presents models that rely exclusively on as few as three WFSAs after pruning more than 90% of the weights.
Controlling Computation versus Quality for Neural Sequence Models
- Computer ScienceArXiv
- 2020
The proposed Conditional Computation Transformer (CCT) is competitive with vanilla Transformers when allowed to utilize its full computational budget, while improving significantly over computationally equivalent baselines when operating on smaller computational budgets.
Are Sixteen Heads Really Better than One?
- Computer ScienceNeurIPS
- 2019
It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- Computer ScienceJ. Mach. Learn. Res.
- 2020
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Reducing Transformer Depth on Demand with Structured Dropout
- Computer ScienceICLR
- 2020
LayerDrop, a form of structured dropout, is explored, which has a regularization effect during training and allows for efficient pruning at inference time, and shows that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance.