Simple Recurrence Improves Masked Language Models

  title={Simple Recurrence Improves Masked Language Models},
  author={Tao Lei and Ran Tian and Jasmijn Bastings and Ankur P. Parikh},
In this work, we explore whether modeling recurrence into the Transformer architecture can both be beneficial and efficient, by building an extremely simple recurrent module into the Transformer. We compare our model to baselines following the training and evaluation recipe of BERT. Our results confirm that recurrence can indeed improve Transformer models by a consistent margin, without requiring low-level performance optimizations, and while keeping the number of parameters constant. For… 

Figures and Tables from this paper

Cramming: Training a Language Model on a Single GPU in One Day

This work investigates the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU, and investigates why scaling down is hard, and which modifications actually improve performance in this scenario.

Modeling Recurrence for Transformer

This work proposes to directly model recurrence for Transformer with an additional recurrence encoder, and introduces a novel attentive recurrent network to leverage the strengths of both attention models and recurrent networks.

Simple Recurrent Units for Highly Parallelizable Recurrence

The Simple Recurrent Unit is proposed, a light recurrent unit that balances model capacity and scalability, designed to provide expressive recurrence, enable highly parallelized implementation, and comes with careful initialization to facilitate training of deep models.

When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

This work presents SRU++, a highly-efficient architecture that combines fast recurrence and attention for sequence modeling that exhibits strong modeling capacity and training efficiency and suggests jointly leveragingFast recurrence with little attention as a promising direction for accelerating model training and inference.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

How to Train BERT with an Academic Budget

It is demonstrated that through a combination of software optimizations, design choices, and hyperparameter tuning, it is possible to produce models that are competitive with BERT-base on GLUE tasks at a fraction of the original pretraining cost.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

TRANS-BLSTM: Transformer with Bidirectional LSTM for Language Understanding

It is shown that TRANS-BLSTM models consistently lead to improvements in accuracy compared to BERT baselines in GLUE and SQuAD 1.1 experiments, and is proposed as a joint modeling framework for transformer and BLSTM.

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented.

Quasi-Recurrent Neural Networks

Quasi-recurrent neural networks (QRNNs), an approach to neural sequence modeling that alternates convolutional layers, which apply in parallel across timesteps, and a minimalist recurrent pooling function that applies inallel across channels are introduced.