• Corpus ID: 238408117

Language Modeling using LMUs: 10x Better Data Efficiency or Improved Scaling Compared to Transformers

@article{Chilkuri2021LanguageMU,
  title={Language Modeling using LMUs: 10x Better Data Efficiency or Improved Scaling Compared to Transformers},
  author={Narsimha Chilkuri and Eric Hunsberger and Aaron R. Voelker and Gurshaant Singh Malik and Chris Eliasmith},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.02402}
}
Recent studies have demonstrated that the performance of transformers on the task of language modeling obeys a power-law relationship with model size over six orders of magnitude. While transformers exhibit impressive scaling, their performance hinges on processing large amounts of data, and their computational and memory requirements grow quadratically with sequence length. Motivated by these considerations, we construct a Legendre Memory Unit based model that introduces a general prior for… 

Figures and Tables from this paper

The spike gating flow: A hierarchical structure-based spiking neural network for online gesture recognition

The few-shot learning (FSL) paradigm of the developed network is concluded: 1) a hierarchical structure-based network design involves prior human knowledge; 2) SNNs for content-based global dynamic feature detection.

References

SHOWING 1-10 OF 13 REFERENCES

Big Bird: Transformers for Longer Sequences

It is shown that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.

Parallelizing Legendre Memory Unit Training

The linear time-invariant (LTI) memory component of the LMU is leveraged to construct a simplified variant that can be parallelized during training (and yet executed as an RNN during inference), thus overcoming a well known limitation of training RNNs on GPUs.

Lite Transformer with Long-Short Range Attention

This paper investigates the mobile setting for NLP tasks to facilitate the deployment on the edge devices and designs Lite Transformer, which demonstrates consistent improvement over the transformer on three well-established language tasks: machine translation, abstractive summarization, and language modeling.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

This work presents the Pile, an 825 GiB English text corpus tar-geted at training large-scale language models, constructed from 22 diverse high-quality subsets—both existing and newly constructed—many of which derive from academic or professional sources.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Improving Language Understanding by Generative Pre-Training

The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.

Improving Spiking Dynamical Networks: Accurate Delays, Higher-Order Synapses, and Time Cells

The theory behind the neural engineering framework is extended to permit the use of a broad class of synapse models while maintaining prescribed dynamics up to a given order, which improves the understanding of how low-level synaptic properties alter the accuracy of high-level computations in spiking dynamical networks.

Implementing FFTs in Practice *

Discussion of the considerations involved in high-performance FFT implementations, which center largely on memory access and other non-arithmetic concerns, as illustrated by a case study of the FFTW