# Long Short-Term Memory

@article{Hochreiter1997LongSM, title={Long Short-Term Memory}, author={Sepp Hochreiter and J{\"u}rgen Schmidhuber}, journal={Neural Computation}, year={1997}, volume={9}, pages={1735-1780} }

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. [...] Key Method Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space… Expand

## Topics from this paper

## 47,313 Citations

On the importance of sluggish state memory for learning long term dependency

- Computer ScienceKnowl. Based Syst.
- 2016

It is demonstrated that an MRN, optimised with noise injection, is able to learn the long term dependency within a complex grammar induction task, significantly outperforming the SRN, NARX and ESN.

Language Modeling through Long-Term Memory Network

- Computer Science, Mathematics2019 International Joint Conference on Neural Networks (IJCNN)
- 2019

This paper introduces Long Term Memory network (LTM), which can tackle the exploding and vanishing gradient problems and handles long sequences without forgetting.

Learning long-term dependencies in segmented-memory recurrent neural networks with backpropagation of error

- Computer ScienceNeurocomputing
- 2014

A comparison on the information latching problem showed that eRTRL is better able to handle the latching of information over longer periods of time, even though eBPTT guaranteed a better generalisation when training was successful, and pre-training significantly improved the ability to learn long-term dependencies with eB PTT.

Learning Sparse Hidden States in Long Short-Term Memory

- Computer ScienceICANN
- 2019

This work proposes to explicitly impose sparsity on the hidden states to adapt them to the required information and shows that sparsity reduces the computational complexity and improves the performance of LSTM networks.

Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks

- Computer ScienceNeurIPS
- 2019

Backpropagation through the ODE solver allows each layer to adapt its internal time-step, enabling the network to learn task-relevant time-scales and exceed state-of-the-art performance among RNNs on permuted sequential MNIST.

Learning long-term dependencies with recurrent neural networks

- Computer ScienceNeurocomputing
- 2008

It is shown that basic time-delay RNN unfolded in time and formulated as state space models are indeed capable of learning time lags of at least a 100 time steps and even possess a self-regularisation characteristic, which adapts the internal error backflow, and analyse their optimal weight initialisation.

Learning Longer Memory in Recurrent Neural Networks

- Computer ScienceICLR
- 2015

This paper shows that learning longer term patterns in real data, such as in natural language, is perfectly possible using gradient descent, by using a slight structural modification of the simple recurrent neural network architecture.

On Extended Long Short-term Memory and Dependent Bidirectional Recurrent Neural Network

- Computer Science, MathematicsNeurocomputing
- 2019

This work first analyzes the memory behavior in three recurrent neural networks cells, then introduces trainable scaling factors that act like an attention mechanism to adjust memory decay adaptively and proposes a dependent bidirectional recurrent neural network (DBRNN).

Low-pass Recurrent Neural Networks - A memory architecture for longer-term correlation discovery

- Computer Science, MathematicsArXiv
- 2018

A simple, effective memory strategy is proposed that can extend the window over which BPTT can learn without requiring longer traces and is explored empirically on a few tasks and discusses its implications.

Internal Memory Gate for Recurrent Neural Networks with Application to Spoken Language Understanding

- Computer ScienceINTERSPEECH
- 2017

The effectiveness and the robustness of the proposed IMG-RNN is evaluated during a classification task of a small corpus of spoken dialogues from the DECODA project that allows us to evaluate the capability of each RNN to code short-term dependencies.

## References

SHOWING 1-10 OF 69 REFERENCES

Learning long-term dependencies in NARX recurrent neural networks

- Computer Science, MedicineIEEE Trans. Neural Networks
- 1996

It is shown that the long-term dependencies problem is lessened for a class of architectures called nonlinear autoregressive models with exogenous (NARX) recurrent neural networks, which have powerful representational capabilities.

Learning Unambiguous Reduced Sequence Descriptions

- Computer ScienceNIPS
- 1991

Experiments show that systems based on these principles can require less computation per time step and many fewer training sequences than conventional training algorithms for recurrent nets.

Bridging Long Time Lags by Weight Guessing and \long Short Term Memory"

- 1996

Numerous recent papers (including many NIPS papers) focus on standard recurrent nets' inability to deal with long time lags between relevant input signals and teacher signals. Rather sophisticated,…

Induction of Multiscale Temporal Structure

- Computer ScienceNIPS
- 1991

Simulation experiments indicate that slower time-scale hidden units are able to pick up global structure, structure that simply can not be learned by standard back propagation, using hidden units that operate with different time constants.

Learning Sequential Structure with the Real-Time Recurrent Learning Algorithm

- Computer ScienceInt. J. Neural Syst.
- 1989

A more powerful recurrent learning procedure, called real-time recurrent learning2,6 (RTRL), is applied to some of the same problems studied by Servan-Schreiber, Cleeremans, and McClelland and revealed that the internal representations developed by RTRL networks revealed that they learn a rich set of internal states that represent more about the past than is required by the underlying grammar.

A Local Learning Algorithm for Dynamic Feedforward and Recurrent Networks

- Computer Science
- 1989

This paper proposes a parallel on-line learning algorithms which performs local computations only, yet still is designed to deal with hidden units and with units whose past activations are ‘hidden in time’.

Continuous history compression

- Computer Science
- 1993

A contininuous version of history compression is described in which elements are discarded in a graded fashion dependent on their predictability, embodied by their (Shannon) information.

Learning long-term dependencies with gradient descent is difficult

- Computer Science, MedicineIEEE Trans. Neural Networks
- 1994

This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.

Learning Complex, Extended Sequences Using the Principle of History Compression

- Computer ScienceNeural Computation
- 1992

A simple principle for reducing the descriptions of event sequences without loss of information is introduced and this insight leads to the construction of neural architectures that learn to divide and conquer by recursively decomposing sequences.

Generalization of backpropagation with application to a recurrent gas market model

- Computer ScienceNeural Networks
- 1988