• Corpus ID: 231698881

RomeBERT: Robust Training of Multi-Exit BERT

  title={RomeBERT: Robust Training of Multi-Exit BERT},
  author={Shijie Geng and Peng Gao and Zuohui Fu and Yongfeng Zhang},
BERT has achieved superior performances on Natural Language Understanding (NLU) tasks. However, BERT possesses a large number of parameters and demands certain resources to deploy. For acceleration, Dynamic Early Exiting for BERT (DeeBERT) has been proposed recently, which incorporates multiple exits and adopts a dynamic early-exit mechanism to ensure efficient inference. While obtaining an efficiency-performance tradeoff, the performances of early exits in multi-exit BERT are significantly… 

Figures and Tables from this paper

LeeBERT: Learned Early Exit for BERT with cross-level optimization
A novel training scheme called Learned Early Exit for BERT (LeeBERT), which asks each exit to learn from each other, rather than learning only from the last layer, and formulate the optimization of LeeBERT as a bi-level optimization problem, and proposes a novel cross- level optimization (CLO) algorithm to improve the optimization results.
A Survey on Green Deep Learning
This paper focuses on presenting a systematic review of the development of Green deep learning technologies, and classifies these approaches into four categories: (1) compact networks, (2) energy-efficient training strategies, (3)Energy-efficient inference approaches, and (4) efficient data usage.
Consistent Accelerated Inference via Confident Adaptive Transformers
This work presents CATs – Confident Adaptive Transformers – in which CATs simultaneously increase computational efficiency, while guaranteeing a specifiable degree of consistency with the original model with high confidence.
Scalable Transformers for Neural Machine Translation
A three-stage training scheme is proposed to tackle the difficulty of training the Scalable Transformers, which introduces additional supervisions from word-level and sequence-level self-distillation.
A Systematic Review of Machine Learning Algorithms in Cyberbullying Detection: Future Directions and Challenges
  • Muhammad Arif
  • Computer Science
    Journal of Information Security and Cybercrimes Research
  • 2021
A systematic review of the current state-of-the-art research in this area is conducted, including various aspects of cyberbullying and its effect on the participating actors.


DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference
This work proposes a simple but effective method, DeeBERT, to accelerate BERT inference, which allows samples to exit earlier without passing through the entire model, and provides new ideas to efficiently apply deep transformer-based models to downstream tasks.
FastBERT: a Self-distilling BERT with Adaptive Inference Time
A novel speed-tunable FastBERT with adaptive inference time that is able to speed up by a wide range from 1 to 12 times than BERT if given different speedup thresholds to make a speed-performance tradeoff.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
BERT Loses Patience: Fast and Robust Inference with Early Exit
The proposed Patience-based Early Exit method couples an internal-classifier with each layer of a PLM and dynamically stops inference when the intermediate predictions of the internal classifiers remain unchanged for a pre-defined number of steps, improving inference efficiency and improving accuracy and robustness.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Born Again Neural Networks
This work studies KD from a new perspective: rather than compressing models, students are trained parameterized identically to their teachers, and shows significant advantages from transferring knowledge between DenseNets and ResNets in either direction.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Rethinking Attention with Performers
Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear space and time complexity, without relying on any priors such as sparsity or low-rankness are introduced.
Deep Networks with Stochastic Depth
Stochastic depth is proposed, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time and reduces training time substantially and improves the test error significantly on almost all data sets that were used for evaluation.