Corpus ID: 232035542

When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

@article{Lei2021WhenAM,
  title={When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute},
  author={Tao Lei},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.12459}
}
  • Tao Lei
  • Published 2021
  • Computer Science
  • ArXiv
Large language models have become increasingly difficult to train because of the required computation time and cost. In this work, we present SRU++, a recurrent unit with optional built-in attention that exhibits state-of-the-art modeling capacity and training efficiency. On standard language modeling benchmarks such as ENWIK8 and WIKI-103 datasets, our model obtains better perplexity and bits-per-character (bpc) while using 2.5x-10x less training time and cost compared to top-performing… Expand

Figures and Tables from this paper

References

SHOWING 1-10 OF 36 REFERENCES
Compressive Transformers for Long-Range Sequence Modelling
Shortformer: Better Language Modeling using Shorter Inputs
The human knowledge compression contest
  • URL http://prize. hutter1. net, 6.
  • 2012
On the Variance of the Adaptive Learning Rate and Beyond
Understanding the Difficulty of Training Transformers
Single Headed Attention RNN: Stop Thinking With Your Head
Single headed attention rnn
  • Stop thinking with your head
  • 2019
One billion word benchmark for measuring progress in statistical language modeling
...
1
2
3
4
...