The Right Tool for the Job: Matching Model and Instance Complexities

@inproceedings{Schwartz2020TheRT,
  title={The Right Tool for the Job: Matching Model and Instance Complexities},
  author={Roy Schwartz and Gabriel Stanovsky and Swabha Swayamdipta and Jesse Dodge and Noah A. Smith},
  booktitle={ACL},
  year={2020}
}
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs. To better respect a given inference budget, we propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) “exit” from neural network calculations for simple instances, and late (and accurate) exit for hard instances. To achieve this, we add classifiers to different layers of BERT and use their… 

Figures and Tables from this paper

Towards Efficient NLP: A Standard Evaluation and A Strong Baseline
TLDR
The proposed ELUE has a strong Pareto Frontier and makes a better evaluation for efficient NLP models, and a strong baseline, ElasticBERT, which allows BERT to exit at any layer in both static and dynamic ways is released.
Elbert: Fast Albert with Confidence-Window Based Early Exit
TLDR
The ELBERT is proposed, which significantly improves the average inference speed compared to ALBERT due to the proposed confidence-window based early exit mechanism, without introducing additional parameters or extra training overhead.
Early Exiting BERT for Efficient Document Ranking
TLDR
Early exiting BERT is introduced for document ranking with a slight modification, BERT becomes a model with multiple output paths, and each inference sample can exit early from these paths, so computation can be effectively allocated among samples.
CascadeBERT: Accelerating Inference of Pre-trained Language Models via Calibrated Complete Models Cascade
TLDR
CascadeBERT is proposed, which dynamically selects proper-sized and complete models in a cascading manner, providing comprehensive representations for predictions, and can achieve an overall 15% improvement under 4× speed-up compared with existing dynamic early exiting methods.
AdapLeR: Speeding up Inference by Adaptive Length Reduction
TLDR
This work proposes a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance, and dynamically eliminates less contributing tokens through layers, resulting in shorter lengths and consequently lower computational cost.
When in Doubt, Summon the Titans: Efficient Inference with Large Models
TLDR
The proposed use of distillation to only handle easy instances allows for a more aggressive trade-off in the student size, thereby reducing the amortized cost of inference and achieving better accuracy than standard distillation.
LeeBERT: Learned Early Exit for BERT with cross-level optimization
TLDR
A novel training scheme called Learned Early Exit for BERT (LeeBERT), which asks each exit to learn from each other, rather than learning only from the last layer, and formulate the optimization of LeeBERT as a bi-level optimization problem, and proposes a novel cross- level optimization (CLO) algorithm to improve the optimization results.
Confident Adaptive Language Modeling
TLDR
This work introduces Confident Adaptive Language Modeling (CALM), a framework for dynamically allocating different amounts of compute per input and generation timestep, and demonstrates theacy of the framework in reducing compute while provably maintaining high performance.
Accelerating Pre-trained Language Models via Calibrated Cascade
TLDR
The working mechanism of dynamic early exiting is analyzed and it is found it cannot achieve a satisfying trade-off between inference speed and performance, so CascadeBERT is proposed, which dynamically selects a proper-sized, complete model in a cascading manner.
Accelerating BERT Inference for Sequence Labeling via Early-Exit
TLDR
The token-level early-exit mechanism brings the gap between training and inference, so an extra self-sampling fine-tuning stage is introduced to alleviate it and can save up to 66% ∼75% inference cost with minimal performance degradation.
...
...

References

SHOWING 1-10 OF 55 REFERENCES
Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling
TLDR
This work develops a layer selection method for model pruning using sparsity-inducing regularization that can detach any layer without affecting others, and stretch shallow and wide LMs to be deep and narrow.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TLDR
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
Show Your Work: Improved Reporting of Experimental Results
TLDR
It is demonstrated that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best, and a novel technique is presented: expected validation performance of the best-found model as a function of computation budget.
TinyBERT: Distilling BERT for Natural Language Understanding
TLDR
A novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models is proposed and, by leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT.
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
TLDR
This paper proposes to distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer BiLSTM, as well as its siamese counterpart for sentence-pair tasks, and achieves comparable results with ELMo.
RNN Architecture Learning with Sparse Regularization
TLDR
This work applies group lasso to rational RNNs (Peng et al., 2018), a family of models that is closely connected to weighted finite-state automata (WFSAs) and shows that sparsifying such models makes them easier to visualize, and presents models that rely exclusively on as few as three WFSAs after pruning more than 90% of the weights.
Controlling Computation versus Quality for Neural Sequence Models
TLDR
The proposed Conditional Computation Transformer (CCT) is competitive with vanilla Transformers when allowed to utilize its full computational budget, while improving significantly over computationally equivalent baselines when operating on smaller computational budgets.
Are Sixteen Heads Really Better than One?
TLDR
It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Reducing Transformer Depth on Demand with Structured Dropout
TLDR
LayerDrop, a form of structured dropout, is explored, which has a regularization effect during training and allows for efficient pruning at inference time, and shows that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance.
...
...