• Corpus ID: 238408117

Language Modeling using LMUs: 10x Better Data Efficiency or Improved Scaling Compared to Transformers

  title={Language Modeling using LMUs: 10x Better Data Efficiency or Improved Scaling Compared to Transformers},
  author={Narsimha Chilkuri and Eric Hunsberger and Aaron R. Voelker and Gurshaant Singh Malik and Chris Eliasmith},
Recent studies have demonstrated that the performance of transformers on the task of language modeling obeys a power-law relationship with model size over six orders of magnitude. While transformers exhibit impressive scaling, their performance hinges on processing large amounts of data, and their computational and memory requirements grow quadratically with sequence length. Motivated by these considerations, we construct a Legendre Memory Unit based model that introduces a general prior for… 

Figures and Tables from this paper

The spike gating flow: A hierarchical structure-based spiking neural network for online gesture recognition

The few-shot learning (FSL) paradigm of the developed network is concluded: 1) a hierarchical structure-based network design involves prior human knowledge; 2) SNNs for content-based global dynamic feature detection.



Big Bird: Transformers for Longer Sequences

It is shown that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.

Parallelizing Legendre Memory Unit Training

The linear time-invariant (LTI) memory component of the LMU is leveraged to construct a simplified variant that can be parallelized during training (and yet executed as an RNN during inference), thus overcoming a well known limitation of training RNNs on GPUs.

Lite Transformer with Long-Short Range Attention

This paper investigates the mobile setting for NLP tasks to facilitate the deployment on the edge devices and designs Lite Transformer, which demonstrates consistent improvement over the transformer on three well-established language tasks: machine translation, abstractive summarization, and language modeling.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

This work presents the Pile, an 825 GiB English text corpus tar-geted at training large-scale language models, constructed from 22 diverse high-quality subsets—both existing and newly constructed—many of which derive from academic or professional sources.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Conformer: Convolution-augmented Transformer for Speech Recognition

This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.

Improving Language Understanding by Generative Pre-Training

The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.

Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks

Backpropagation through the ODE solver allows each layer to adapt its internal time-step, enabling the network to learn task-relevant time-scales and exceed state-of-the-art performance among RNNs on permuted sequential MNIST.