Embedding Recycling for Language Models

  title={Embedding Recycling for Language Models},
  author={Jon Saad-Falcon and Amanpreet Singh and Luca Soldaini and Mike D'Arcy and Arman Cohan and Doug Downey},
Training and inference with large neural models is expensive. However, for many application domains, while new tasks and models arise frequently, the underlying doc-uments being modeled remain mostly un-changed. We study how to decrease computational cost in such settings through embedding recycling (ER): re-using activations from previous model runs when performing training or inference. In contrast to prior work focusing on freezing small classification heads for finetuning which often leads to… 

Figures and Tables from this paper



General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference

This work aims to reduce the inference cost in a setting where many different predictions are made on a single piece of text, and shows that through binary quantization, it can reduce the size of the extracted representations by a factor of 16 to store them for later use.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Accelerating Deep Learning Inference via Freezing

It is observed that caching intermediate layer outputs can help to avoid running all the layers of a DNN for a sizeable fraction of inference requests, and this system is presented, a system that introduces approximate caching at each intermediate layer and techniques to reduce the cache size and improve the cache hit rate.

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

This work proposes a simple but effective method, DeeBERT, to accelerate BERT inference, which allows samples to exit earlier without passing through the entire model, and provides new ideas to efficiently apply deep transformer-based models to downstream tasks.

What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning

This paper examines two recent pretrained language models, BERT and RoBERTa, across standard tasks in textual entailment, semantic similarity, sentiment analysis, and linguistic acceptability, and shows that only a fourth of the final layers need to be fine-tuned to achieve 90% of the original quality.

Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

Though initially proposed as an efficient method to steer large models, some of the fascinating evidence discovered along with delta tuning could help further reveal the mechanisms of PLMs and even deep neural networks.

AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning

This work proposes, AutoFreeze, a system that uses an adaptive approach to choose which layers are trained and shows how this can accelerate model fine-tuning while preserving accuracy, and develops mechanisms to enable efficient caching of intermediate activations which can reduce the forward computation time when performing fine- Tuning.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Parameter-Efficient Transfer Learning for NLP

To demonstrate adapter's effectiveness, the recently proposed BERT Transformer model is transferred to 26 diverse text classification tasks, including the GLUE benchmark, and adapter attain near state-of-the-art performance, whilst adding only a few parameters per task.

Universal Language Model Fine-tuning for Text Classification

This work proposes Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduces techniques that are key for fine- Tuning a language model.