Accelerating BERT Inference for Sequence Labeling via Early-Exit

@article{Li2021AcceleratingBI,
  title={Accelerating BERT Inference for Sequence Labeling via Early-Exit},
  author={Xiaonan Li and Yunfan Shao and Tianxiang Sun and Hang Yan and Xipeng Qiu and Xuanjing Huang},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.13878}
}
Both performance and efficiency are crucial factors for sequence labeling tasks in many real-world scenarios. Although the pre-trained models (PTMs) have significantly improved the performance of various sequence labeling tasks, their computational cost is expensive. To alleviate this problem, we extend the recent successful early-exit mechanism to accelerate the inference of PTMs for sequence labeling tasks. However, existing early-exit mechanisms are specifically designed for sequencelevel… Expand

Figures and Tables from this paper

CascadeBERT: Accelerating Inference of Pre-trained Language Models via Calibrated Complete Models Cascade
  • Lei Li, Yankai Lin, +4 authors Xu Sun
  • Computer Science
  • 2020
TLDR
CascadeBERT is proposed, which dynamically selects proper-sized and complete models in a cascading manner, providing comprehensive representations for predictions, and can achieve an overall 15% improvement under 4× speed-up compared with existing dynamic early exiting methods. Expand
Towards Efficient NLP: A Standard Evaluation and A Strong Baseline
  • Xiangyang Liu, Tianxiang Sun, +6 authors Xipeng Qiu
  • Computer Science
  • 2021
TLDR
The ElasticBERT, despite its simplicity, outperforms or performs on par with SOTA compressed and early exiting models and the ELUE benchmark, a standard evaluation, and public leaderboard for efficient NLP models. Expand
A Survey of Transformers
TLDR
This survey provides a comprehensive review of various Transformer variants and proposes a new taxonomy of X-formers from three perspectives: architectural modification, pre-training, and applications. Expand
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation
  • Yunfan Shao, Zhichao Geng, +5 authors Xipeng Qiu
  • Computer Science
  • ArXiv
  • 2021
TLDR
The unbalanced Transformer saves the computational and storage cost, which makes CPT competitive and greatly accelerates the inference of text generation. Expand
Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search
TLDR
The proposed extension of PoWER-BERT enables to train a large-scale transformer once and uses it for various inference scenarios without re-training it, and significantly extends the applicability of PoBERT beyond sequence- level classification into token-level classification such as span-based question-answering by introducing the idea of Drop-and-Restore. Expand
Pre-trained Models for Natural Language Processing: A Survey
TLDR
This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks. Expand

References

SHOWING 1-10 OF 44 REFERENCES
Accelerating Pre-trained Language Models via Calibrated Cascade
TLDR
The working mechanism of dynamic early exiting is analyzed and it is found it cannot achieve a satisfying trade-off between inference speed and performance, so CascadeBERT is proposed, which dynamically selects a proper-sized, complete model in a cascading manner. Expand
The Right Tool for the Job: Matching Model and Instance Complexities
TLDR
This work proposes a modification to contextual representation fine-tuning which allows for an early (and fast) “exit” from neural network calculations for simple instances, and late (and accurate) exit for hard instances during inference. Expand
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF
TLDR
A novel neutral network architecture is introduced that benefits from both word- and character-level representations automatically, by using combination of bidirectional LSTM, CNN and CRF, thus making it applicable to a wide range of sequence labeling tasks. Expand
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference
TLDR
This work proposes a simple but effective method, DeeBERT, to accelerate BERT inference, which allows samples to exit earlier without passing through the entire model, and provides new ideas to efficiently apply deep transformer-based models to downstream tasks. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TLDR
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses. Expand
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence. Expand
Reducing Transformer Depth on Demand with Structured Dropout
TLDR
LayerDrop, a form of structured dropout, is explored, which has a regularization effect during training and allows for efficient pruning at inference time, and shows that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Pre-Training with Whole Word Masking for Chinese BERT
TLDR
This technical report adapt whole word masking in Chinese text, that masking the whole word instead of masking Chinese characters, which could bring another challenge in Masked Language Model (MLM) pre-training task. Expand
...
1
2
3
4
5
...