Learning to Forget: Continual Prediction with LSTM

@article{Gers2000LearningTF,
  title={Learning to Forget: Continual Prediction with LSTM},
  author={Felix Alexander Gers and J{\"u}rgen Schmidhuber and Fred Cummins},
  journal={Neural Computation},
  year={2000},
  volume={12},
  pages={2451-2471}
}
Long short-term memory (LSTM; Hochreiter & Schmidhuber, 1997) can solve numerous tasks not solvable by previous learning algorithms for recurrent neural networks (RNNs). We identify a weakness of LSTM networks processing continual input streams that are not a priori segmented into subsequences with explicitly marked ends at which the network's internal state could be reset. Without resets, the state may grow indefinitely and eventually cause the network to break down. Our remedy is a novel… 
Learning to Forget: Continual Prediction with Lstm Learning to Forget: Continual Prediction with Lstm
TLDR
This work identifies a weakness of LSTM networks processing continual input streams without explicitly marked sequence ends and proposes an adaptive "forget gate" that enables an L STM cell to learn to reset itself at appropriate times, thus releasing internal resources.
Learning Precise Timing with LSTM Recurrent Networks
TLDR
This work finds that LSTM augmented by "peephole connections" from its internal cells to its multiplicative gates can learn the fine distinction between sequences of spikes spaced either 50 or 49 time steps apart without the help of any short training exemplars.
An Empirical Exploration of Recurrent Network Architectures
TLDR
It is found that adding a bias of 1 to the LSTM's forget gate closes the gap between the L STM and the recently-introduced Gated Recurrent Unit (GRU) on some but not all tasks.
Revisiting NARX Recurrent Neural Networks for Long-Term Dependencies
TLDR
It is shown that for τ ≤ 2nd−1, MIST RNNs reduce the decay’s worst-case exponent from τ/nd to log τ , while maintaining computational complexity that is similar to LSTM and GRUs.
EXTRAPOLATED INPUT NETWORK SIMPLIFICATION
  • Computer Science
  • 2019
TLDR
This paper contrasts the two canonical recurrent neural networks of long short-term memory (LSTM) and gated recurrent unit (GRU) to propose the novel light-weight RNN of extrapolated input for network simplification (EINS), and presents a design that abandons the LSTM redundancies, thereby introducing EINS.
Recurrent Neural Networks for Learning Long-term Temporal Dependencies with Reanalysis of Time Scale Representation
TLDR
It is argued that the interpretation of a forget gate as a temporal representation is valid when the gradient of loss with respect to the state decreases exponentially as time goes back and empirically demonstrates that existing RNNs satisfy this gradient condition at the initial training phase on several tasks, which is in good agreement with previous initialization methods.
A review on the long short-term memory model
TLDR
A comprehensive review of LSTM’s formulation and training, relevant applications reported in the literature and code resources implementing this model for a toy example are presented.
Learning compact recurrent neural networks
TLDR
This work study mechanisms for learning compact RNNs and LSTMs via low-rank factorizations and parameter sharing schemes, and finds a hybrid strategy of using structured matrices in the bottom layers and shared low- rank factors on the top layers to be particularly effective.
Gated Orthogonal Recurrent Units: On Learning to Forget
We present a novel recurrent neural network (RNN)–based model that combines the remembering ability of unitary evolution RNNs with the ability of gated RNNs to effectively forget redundant or
Recurrent nets that time and count
  • Felix Alexander GersJ. Schmidhuber
  • Computer Science
    Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium
  • 2000
TLDR
Surprisingly, LSTM augmented by "peephole connections" from its internal cells to its multiplicative gates can learn the fine distinction between sequences of spikes separated by either 50 or 49 discrete time steps, without the help of any short training exemplars.
...
...

References

SHOWING 1-10 OF 62 REFERENCES
Learning to Forget: Continual Prediction with Lstm Learning to Forget: Continual Prediction with Lstm
TLDR
This work identifies a weakness of LSTM networks processing continual input streams without explicitly marked sequence ends and proposes an adaptive "forget gate" that enables an L STM cell to learn to reset itself at appropriate times, thus releasing internal resources.
Long Short-Term Memory
TLDR
A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
LSTM recurrent networks learn simple context-free and context-sensitive languages
TLDR
Long short-term memory (LSTM) variants are also the first RNNs to learn a simple context-sensitive language, namely a(n)b( n)c(n).
Learning long-term dependencies with gradient descent is difficult
TLDR
This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.
Gradient calculations for dynamic recurrent neural networks: a survey
TLDR
The author discusses advantages and disadvantages of temporally continuous neural networks in contrast to clocked ones and presents some "tricks of the trade" for training, using, and simulating continuous time and recurrent neural networks.
Learning Sequential Structure with the Real-Time Recurrent Learning Algorithm
TLDR
A more powerful recurrent learning procedure, called real-time recurrent learning2,6 (RTRL), is applied to some of the same problems studied by Servan-Schreiber, Cleeremans, and McClelland and revealed that the internal representations developed by RTRL networks revealed that they learn a rich set of internal states that represent more about the past than is required by the underlying grammar.
The Recurrent Cascade-Correlation Architecture
Recurrent Cascade-Correlation (RCC) is a recurrent version of the Cascade-Correlation learning architecture of Fahlman and Lebiere [Fahlman, 1990]. RCC can learn from examples to map a sequence of
Finite State Automata and Simple Recurrent Networks
TLDR
A network architecture introduced by Elman (1988) for predicting successive elements of a sequence and shows that long distance sequential contingencies can be encoded by the network even if only subtle statistical properties of embedded strings depend on the early information.
A Local Learning Algorithm for Dynamic Feedforward and Recurrent Networks
TLDR
This paper proposes a parallel on-line learning algorithms which performs local computations only, yet still is designed to deal with hidden units and with units whose past activations are ‘hidden in time’.
...
...