Long Short-Term Memory

@article{Hochreiter1997LongSM,
  title={Long Short-Term Memory},
  author={Sepp Hochreiter and J{\"u}rgen Schmidhuber},
  journal={Neural Computation},
  year={1997},
  volume={9},
  pages={1735-1780}
}
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. [] Key Method Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space…

Learning Long-Term Dependencies in Irregularly-Sampled Time Series

This work designs a new algorithm based on the long short-term memory (LSTM) that separates its memory from its time-continuous state within the RNN, allowing it to respond to inputs arriving at arbitrary time-lags while ensuring a constant error propagation through the memory path.

On the importance of sluggish state memory for learning long term dependency

Language Modeling through Long-Term Memory Network

This paper introduces Long Term Memory network (LTM), which can tackle the exploding and vanishing gradient problems and handles long sequences without forgetting.

Learning Sparse Hidden States in Long Short-Term Memory

This work proposes to explicitly impose sparsity on the hidden states to adapt them to the required information and shows that sparsity reduces the computational complexity and improves the performance of LSTM networks.

Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks

Backpropagation through the ODE solver allows each layer to adapt its internal time-step, enabling the network to learn task-relevant time-scales and exceed state-of-the-art performance among RNNs on permuted sequential MNIST.

Learning Longer Memory in Recurrent Neural Networks

This paper shows that learning longer term patterns in real data, such as in natural language, is perfectly possible using gradient descent, by using a slight structural modification of the simple recurrent neural network architecture.

On Extended Long Short-term Memory and Dependent Bidirectional Recurrent Neural Network

Low-pass Recurrent Neural Networks - A memory architecture for longer-term correlation discovery

A simple, effective memory strategy is proposed that can extend the window over which BPTT can learn without requiring longer traces and is explored empirically on a few tasks and discusses its implications.
...

References

SHOWING 1-10 OF 49 REFERENCES

Learning long-term dependencies in NARX recurrent neural networks

It is shown that the long-term dependencies problem is lessened for a class of architectures called nonlinear autoregressive models with exogenous (NARX) recurrent neural networks, which have powerful representational capabilities.

Learning Unambiguous Reduced Sequence Descriptions

Experiments show that systems based on these principles can require less computation per time step and many fewer training sequences than conventional training algorithms for recurrent nets.

Bridging Long Time Lags by Weight Guessing and \long Short Term Memory"

Long short term memory (LSTM), their own recent algorithm, is used to solve hard problems that can neither be quickly solved by random weight guessing nor by any other recurrent net algorithm the authors are aware of.

Induction of Multiscale Temporal Structure

Simulation experiments indicate that slower time-scale hidden units are able to pick up global structure, structure that simply can not be learned by standard back propagation, using hidden units that operate with different time constants.

Learning Sequential Structure with the Real-Time Recurrent Learning Algorithm

A more powerful recurrent learning procedure, called real-time recurrent learning2,6 (RTRL), is applied to some of the same problems studied by Servan-Schreiber, Cleeremans, and McClelland and revealed that the internal representations developed by RTRL networks revealed that they learn a rich set of internal states that represent more about the past than is required by the underlying grammar.

A Local Learning Algorithm for Dynamic Feedforward and Recurrent Networks

This paper proposes a parallel on-line learning algorithms which performs local computations only, yet still is designed to deal with hidden units and with units whose past activations are ‘hidden in time’.

Continuous history compression

A contininuous version of history compression is described in which elements are discarded in a graded fashion dependent on their predictability, embodied by their (Shannon) information.

Learning long-term dependencies with gradient descent is difficult

This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.

Generalization of backpropagation with application to a recurrent gas market model

Gradient calculations for dynamic recurrent neural networks: a survey

The author discusses advantages and disadvantages of temporally continuous neural networks in contrast to clocked ones and presents some "tricks of the trade" for training, using, and simulating continuous time and recurrent neural networks.