DACT-BERT: Differentiable Adaptive Computation Time for an Efficient BERT Inference

  title={DACT-BERT: Differentiable Adaptive Computation Time for an Efficient BERT Inference},
  author={Cristobal Eyzaguirre and Felipe del R'io and Vladimir Araujo and 'Alvaro Soto},
Large-scale pre-trained language models have shown remarkable results in diverse NLP applications. However, these performance gains have been accompanied by a significant increase in computation time and model size, stressing the need to develop new or complementary strategies to increase the efficiency of these models. This paper proposes DACT-BERT, a differentiable adaptive computation time strategy for BERT-like models. DACT-BERT adds an adaptive computational mechanism to BERT’s regular… 

Figures from this paper

AdapLeR: Speeding up Inference by Adaptive Length Reduction

This work proposes a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance, and dynamically eliminates less contributing tokens through layers, resulting in shorter lengths and consequently lower computational cost.

Entropy-based Stability-Plasticity for Lifelong Learning

Entropy-based Stability-Plasticity (ESP) is proposed, which can decide dynamically how much each model layer should be modified via a plasticity factor and incorporates branch layers and an entropy-based criterion into the model to find such factor.



FastBERT: a Self-distilling BERT with Adaptive Inference Time

A novel speed-tunable FastBERT with adaptive inference time that is able to speed up by a wide range from 1 to 12 times than BERT if given different speedup thresholds to make a speed-performance tradeoff.

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

This work proposes a simple but effective method, DeeBERT, to accelerate BERT inference, which allows samples to exit earlier without passing through the entire model, and provides new ideas to efficiently apply deep transformer-based models to downstream tasks.

Differentiable Adaptive Computation Time for Visual Reasoning

This paper presents a novel attention-based algorithm for achieving adaptive computation called DACT, which, unlike existing ones, is end-to-end differentiable and presents adaptive computation as an equivalent to an ensemble of models, similar to a mixture of expert formulation.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.

TinyBERT: Distilling BERT for Natural Language Understanding

A novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models is proposed and, by leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT.

Adaptive Computation Time for Recurrent Neural Networks

Performance is dramatically improved and insight is provided into the structure of the data, with more computation allocated to harder-to-predict transitions, such as spaces between words and ends of sentences, which suggests that ACT or other adaptive computation methods could provide a generic method for inferring segment boundaries in sequence data.

BERxiT: Early Exiting for BERT with Better Fine-Tuning and Extension to Regression

This paper proposes a more advanced fine-tuning strategy and a learning-to-exit module that extends early exiting to tasks other than classification, and demonstrates improved early exiting for BERT.

BERT Loses Patience: Fast and Robust Inference with Early Exit

The proposed Patience-based Early Exit method couples an internal-classifier with each layer of a PLM and dynamically stops inference when the intermediate predictions of the internal classifiers remain unchanged for a pre-defined number of steps, improving inference efficiency and improving accuracy and robustness.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.