• Corpus ID: 208291331

Single Headed Attention RNN: Stop Thinking With Your Head

  title={Single Headed Attention RNN: Stop Thinking With Your Head},
  author={Stephen Merity},
The leading approaches in language modeling are all obsessed with TV shows of my youth - namely Transformers and Sesame Street. Transformers this, Transformers that, and over here a bonfire worth of GPU-TPU-neuromorphic wafer scale silicon. We opt for the lazy path of old and proven techniques with a fancy crypto inspired acronym: the Single Headed Attention RNN (SHA-RNN). The author's lone goal is to show that the entire field might have evolved a different direction if we had instead been… 

Figures and Tables from this paper

SHAQ: Single Headed Attention with Quasi-Recurrence
This work combines Stephen Merity’s SHA-RNN with a new architecture which it is called SHAQ: Single Headed Attention Quasi-recurrent Neural Network, achieving similar accuracy results as the SHA- RNN while accomplishing a 4x speed boost in training.
WaLDORf: Wasteless Language-model Distillation On Reading-comprehension
A novel set of techniques are proposed which together produce a task-specific hybrid convolutional and transformer model, WaLDORf, that achieves state-of-the-art inference speed while still being more accurate than previous distilled models.
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
This work presents SRU++, a highly-efficient architecture that combines fast recurrence and attention for sequence modeling that exhibits strong modeling capacity and training efficiency and suggests jointly leveragingFast recurrence with little attention as a promising direction for accelerating model training and inference.
Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard-300
It is shown that state-of-the-art recognition performance can be achieved on the Switchboard-300 database using a single headed attention, LSTM based model.
SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition
SRU++ can surpass Conformer on long-form speech input with a large margin, based on analysis, and can be generalized to long- form speech inputs.
Attention vs non-attention for a Shapley-based explanation method
Contextual Decomposition is extended to cover the operations necessary for attention-based models, providing an alternative Shapley-based attribution method for modern neural networks and showing that the English and Dutch models demonstrate similar processing behaviour, but that under the hood there are consistent differences between non-attention models.
Automated essay scoring using efficient transformer-based language models
This paper evaluates the performance of several fine-tuned pretrained NLP models with a modest number of parameters on an AES dataset and achieves excellent results with fewer parameters than most pretrained transformer-based models.
Mukayese: Turkish NLP Strikes Back
This paper presents Mukayese, a set of NLP benchmarks for the Turkish language that contains several NLP tasks and presents four new benchmarking datasets in Turkish for language modeling, sentence segmentation, and spell checking.
LegaLMFiT: Efficient Short Legal Text Classification with LSTM Language Model Pre-Training
This work shows that lightweight LSTMbased Language Models are able to capture enough information from a small legal text pretraining corpus and achieve excellent performance on short legal text classification tasks, with a significantly reduced computational overhead compared to BERTbased models.
Improving the Gating Mechanism of Recurrent Neural Networks
Two synergistic modifications to the standard gating mechanism that are easy to implement, introduce no additional hyperparameters, and improve learnability of the gates when they are close to saturation robustly improve the performance of recurrent models on a range of applications.


Mogrifier LSTM
This work proposes an extension to the venerable Long Short-Term Memory in the form of mutual gating of the current input and the previous output, which affords the modelling of a richer space of interactions between inputs and their context.
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.
Regularizing and Optimizing LSTM Language Models
This paper proposes the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization and introduces NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user.
Fast Transformer Decoding: One Write-Head is All You Need
This work proposes a variant called multi-query attention, where the keys and values are shared across all of the different attention "heads", greatly reducing the size of these tensors and hence the memory bandwidth requirements of incremental decoding.
Character-Level Language Modeling with Deeper Self-Attention
This paper shows that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks: 1.13 bits per character on text8 and 1.06 on enwik8.
Character-level language modeling with hierarchical recurrent neural networks
  • Kyuyeon HwangWonyong Sung
  • Computer Science
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
This work proposes hierarchical RNN architectures, which consist of multiple modules with different timescales, and shows better perplexity than Kneser-Ney (KN) 5-gram WLMs on the One Billion Word Benchmark with only 2% of parameters.
Adaptive Attention Span in Transformers
We propose a novel self-attention mechanism that can learn its optimal attention span. This allows us to extend significantly the maximum context size used in Transformer, while maintaining control
Generating Long Sequences with Sparse Transformers
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.
Recurrent Highway Networks
A novel theoretical analysis of recurrent networks based on Gersgorin's circle theorem is introduced that illuminates several modeling and optimization issues and improves the understanding of the LSTM cell.
An Analysis of Neural Language Modeling at Multiple Scales
This work takes existing state-of-the-art word level language models based on LSTMs and QRNNs and extend them to both larger vocabularies as well as character-level granularity, achieving state- of- the-art results on character- level and word-level datasets.