When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

  title={When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute},
  author={Tao Lei},
  • Tao Lei
  • Published in EMNLP 24 February 2021
  • Computer Science
Large language models have become increasingly difficult to train because of the growing computation time and cost. In this work, we present SRU++, a highly-efficient architecture that combines fast recurrence and attention for sequence modeling. SRU++ exhibits strong modeling capacity and training efficiency. On standard language modeling tasks such as Enwik8, Wiki-103 and Billion Word datasets, our model obtains better bits-per-character and perplexity while using 3x-10x less training cost… 
Antibody-Antigen Docking and Design via Hierarchical Equivariant Refinement
A new model called Hierarchical Equivariant Refinement Network (HERN) is proposed for paratope docking and design that outperforms prior state-of-the-art on paratopes docking andDesign benchmarks.
Confident Adaptive Language Modeling
This work introduces Confident Adaptive Language Modeling (CALM), a framework for dynamically allocating different amounts of compute per input and generation timestep, and demonstrates theacy of the framework in reducing compute while provably maintaining high performance.
A Waveform Mapping-Based Approach for Enhancement of Trunk Borers’ Vibration Signals Using Deep Learning Model
A boring vibration enhancement model named VibDenoiser is constructed, which makes a significant contribution to this rarely studied domain and substantially increases the accuracy of several well-known classification models, guaranteeing a more practical larvae detection.
Simple Recurrence Improves Masked Language Models
It is found that recurrence can indeed improve Transformer models by a consistent margin, without requiring low-level performance optimizations, and while keeping the number of parameters con-stant.
Implicit n-grams Induced by Recurrence
This work presents a study that shows there actually exist some explainable componentsthat reside within the hidden states, which are reminiscent of the classical n-grams features, which could add interpretability to RNN architectures, and also provide inspirations for proposing new architectures for sequential data.
Block-Recurrent Transformers
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence, and has linear complexity with respect to sequence length. Our recurrent cell
Mukayese: Turkish NLP Strikes Back
This paper presents Mukayese, a set of NLP benchmarks for the Turkish language that contains several NLP tasks and presents four new benchmarking datasets in Turkish for language modeling, sentence segmentation, and spell checking.
Simple Local Attentions Remain Competitive for Long-Context Tasks
This work pretrain large-size models using the same long-doc corpus and then finetune these models for real-world long-context tasks, revealing pitfalls of an existing widely-used long-range benchmark and showing none of the tested efficient attentions can beat a simple local window attention under standard pretraining paradigms.
SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition
SRU++ can surpass Conformer on long-form speech input with a large margin, based on analysis, and can be generalized to long- form speech inputs.
Iterative Refinement Graph Neural Network for Antibody Sequence-Structure Co-design
This paper proposes a generative model to automatically design the CDRs of antibodies with enhanced binding specificity or neutralization capabilities, and achieves superior log-likelihood on the test set and outperforms previous baselines in designing antibodies capable of neutralizing the SARS-CoV-2 virus.


Shortformer: Better Language Modeling using Shorter Inputs
This work identifies conditions where shorter inputs are not harmful, and achieves perplexity and efficiency improvements through two new methods that decrease input length, and shows how to improve the efficiency of recurrence methods in transformers.
The human knowledge compression contest
  • http://prize.hutter1.net/.
  • 2006
Autoregressive Knowledge Distillation through Imitation Learning
A compression technique for autoregressive models that is driven by an imitation learning perspective on knowledge distillation that consistently outperforms other distillation algorithms, such as sequence-level knowledgedistillation.
Understanding the Difficulty of Training Transformers
It is revealed that for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable since it amplifies small parameter perturbations and result in significant disturbances in the model output, yet a light dependency limits the potential of model training and can lead to an inferior trained model.
Single Headed Attention RNN: Stop Thinking With Your Head
This work has undergone no intensive hyperparameter optimization and lived entirely on a commodity desktop machine that made the author's small studio apartment far too warm in the midst of a San Franciscan summer.
On the Variance of the Adaptive Learning Rate and Beyond
This work identifies a problem of the adaptive learning rate, suggests warmup works as a variance reduction technique, and proposes RAdam, a new variant of Adam, by introducing a term to rectify the variance of theadaptive learning rate.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Pointer Sentinel Mixture Models
The pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank while using far fewer parameters than a standard softmax LSTM and the freely available WikiText corpus is introduced.
Optimizing Performance of Recurrent Neural Networks on GPUs
It is demonstrated that by exposing parallelism between operations within the network, an order of magnitude speedup across a range of network sizes can be achieved over a naive implementation.
Long Short-Term Memory
A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.