# Learning to Forget: Continual Prediction with LSTM

@article{Gers2000LearningTF, title={Learning to Forget: Continual Prediction with LSTM}, author={Felix Alexander Gers and J{\"u}rgen Schmidhuber and Fred Cummins}, journal={Neural Computation}, year={2000}, volume={12}, pages={2451-2471} }

Long short-term memory (LSTM; Hochreiter & Schmidhuber, 1997) can solve numerous tasks not solvable by previous learning algorithms for recurrent neural networks (RNNs). We identify a weakness of LSTM networks processing continual input streams that are not a priori segmented into subsequences with explicitly marked ends at which the network's internal state could be reset. Without resets, the state may grow indefinitely and eventually cause the network to break down. Our remedy is a novel…

## 3,479 Citations

Learning to Forget: Continual Prediction with Lstm Learning to Forget: Continual Prediction with Lstm

- Computer Science
- 1999

This work identifies a weakness of LSTM networks processing continual input streams without explicitly marked sequence ends and proposes an adaptive "forget gate" that enables an L STM cell to learn to reset itself at appropriate times, thus releasing internal resources.

Learning Precise Timing with LSTM Recurrent Networks

- Computer ScienceJ. Mach. Learn. Res.
- 2002

This work finds that LSTM augmented by "peephole connections" from its internal cells to its multiplicative gates can learn the fine distinction between sequences of spikes spaced either 50 or 49 time steps apart without the help of any short training exemplars.

An Empirical Exploration of Recurrent Network Architectures

- Computer ScienceICML
- 2015

It is found that adding a bias of 1 to the LSTM's forget gate closes the gap between the L STM and the recently-introduced Gated Recurrent Unit (GRU) on some but not all tasks.

Revisiting NARX Recurrent Neural Networks for Long-Term Dependencies

- Computer ScienceArXiv
- 2017

It is shown that for τ ≤ 2nd−1, MIST RNNs reduce the decay’s worst-case exponent from τ/nd to log τ , while maintaining computational complexity that is similar to LSTM and GRUs.

EXTRAPOLATED INPUT NETWORK SIMPLIFICATION

- Computer Science
- 2019

This paper contrasts the two canonical recurrent neural networks of long short-term memory (LSTM) and gated recurrent unit (GRU) to propose the novel light-weight RNN of extrapolated input for network simplification (EINS), and presents a design that abandons the LSTM redundancies, thereby introducing EINS.

Recurrent Neural Networks for Learning Long-term Temporal Dependencies with Reanalysis of Time Scale Representation

- Computer Science2021 IEEE International Conference on Big Knowledge (ICBK)
- 2021

It is argued that the interpretation of a forget gate as a temporal representation is valid when the gradient of loss with respect to the state decreases exponentially as time goes back and empirically demonstrates that existing RNNs satisfy this gradient condition at the initial training phase on several tasks, which is in good agreement with previous initialization methods.

A review on the long short-term memory model

- Computer ScienceArtificial Intelligence Review
- 2020

A comprehensive review of LSTM’s formulation and training, relevant applications reported in the literature and code resources implementing this model for a toy example are presented.

Learning compact recurrent neural networks

- Computer Science2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2016

This work study mechanisms for learning compact RNNs and LSTMs via low-rank factorizations and parameter sharing schemes, and finds a hybrid strategy of using structured matrices in the bottom layers and shared low- rank factors on the top layers to be particularly effective.

Gated Orthogonal Recurrent Units: On Learning to Forget

- Computer ScienceNeural Computation
- 2019

We present a novel recurrent neural network (RNN)–based model that combines the remembering ability of unitary evolution RNNs with the ability of gated RNNs to effectively forget redundant or…

Recurrent nets that time and count

- Computer ScienceProceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium
- 2000

Surprisingly, LSTM augmented by "peephole connections" from its internal cells to its multiplicative gates can learn the fine distinction between sequences of spikes separated by either 50 or 49 discrete time steps, without the help of any short training exemplars.

## References

SHOWING 1-10 OF 62 REFERENCES

Learning to Forget: Continual Prediction with Lstm Learning to Forget: Continual Prediction with Lstm

- Computer Science
- 1999

This work identifies a weakness of LSTM networks processing continual input streams without explicitly marked sequence ends and proposes an adaptive "forget gate" that enables an L STM cell to learn to reset itself at appropriate times, thus releasing internal resources.

Long Short-Term Memory

- Computer ScienceNeural Computation
- 1997

A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

LSTM recurrent networks learn simple context-free and context-sensitive languages

- Computer ScienceIEEE Trans. Neural Networks
- 2001

Long short-term memory (LSTM) variants are also the first RNNs to learn a simple context-sensitive language, namely a(n)b( n)c(n).

Learning long-term dependencies with gradient descent is difficult

- Computer ScienceIEEE Trans. Neural Networks
- 1994

This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.

Gradient calculations for dynamic recurrent neural networks: a survey

- Computer ScienceIEEE Trans. Neural Networks
- 1995

The author discusses advantages and disadvantages of temporally continuous neural networks in contrast to clocked ones and presents some "tricks of the trade" for training, using, and simulating continuous time and recurrent neural networks.

Learning Sequential Structure with the Real-Time Recurrent Learning Algorithm

- Computer ScienceInt. J. Neural Syst.
- 1989

A more powerful recurrent learning procedure, called real-time recurrent learning2,6 (RTRL), is applied to some of the same problems studied by Servan-Schreiber, Cleeremans, and McClelland and revealed that the internal representations developed by RTRL networks revealed that they learn a rich set of internal states that represent more about the past than is required by the underlying grammar.

The Recurrent Cascade-Correlation Architecture

- Computer ScienceNIPS
- 1990

Recurrent Cascade-Correlation (RCC) is a recurrent version of the Cascade-Correlation learning architecture of Fahlman and Lebiere [Fahlman, 1990]. RCC can learn from examples to map a sequence of…

Finite State Automata and Simple Recurrent Networks

- Computer ScienceNeural Computation
- 1989

A network architecture introduced by Elman (1988) for predicting successive elements of a sequence and shows that long distance sequential contingencies can be encoded by the network even if only subtle statistical properties of embedded strings depend on the early information.

A Local Learning Algorithm for Dynamic Feedforward and Recurrent Networks

- Computer Science
- 1989

This paper proposes a parallel on-line learning algorithms which performs local computations only, yet still is designed to deal with hidden units and with units whose past activations are ‘hidden in time’.

Generalization of backpropagation with application to a recurrent gas market model

- MathematicsNeural Networks
- 1988