LSTM recurrent networks learn simple context-free and context-sensitive languages

@article{Gers2001LSTMRN,
  title={LSTM recurrent networks learn simple context-free and context-sensitive languages},
  author={Felix A. Gers and J{\"u}rgen Schmidhuber},
  journal={IEEE transactions on neural networks},
  year={2001},
  volume={12 6},
  pages={
          1333-40
        }
}
Previous work on learning regular languages from exemplary training sequences showed that long short-term memory (LSTM) outperforms traditional recurrent neural networks (RNNs). We demonstrate LSTMs superior performance on context-free language benchmarks for RNNs, and show that it works even better than previous hardwired or highly specialized architectures. To the best of our knowledge, LSTM variants are also the first RNNs to learn a simple context-sensitive language, namely a(n)b(n)c(n). 
On learning context-free and context-sensitive languages
The long short-term memory (LSTM) is not the only neural network which learns a context sensitive language. Second-order sequential cascaded networks (SCNs) are able to induce means from a finite
Understanding LSTM - a tutorial into Long Short-Term Memory Recurrent Neural Networks
TLDR
This paper significantly improved documentation and fixed a number of errors and inconsistencies that accumulated in previous publications, focusing on the early, ground-breaking publications of LSTM-RNN.
Learning Context Sensitive Languages with LSTM Trained with Kalman Filters
TLDR
This novel combination of LSTM and the decoupled extended Kalman filter learns even faster and generalizes even better, requiring only the 10 shortest exemplars of the context sensitive language anbncn to deal correctly with values of up to 1000 and more.
On Evaluating the Generalization of LSTM Models in Formal Languages
TLDR
This paper empirically evaluates the inductive learning capabilities of Long Short-Term Memory networks, a popular extension of simple RNNs, to learn simple formal languages, in particular ab, abc, and abcd.
Revisit Long Short-Term Memory: An Optimization Perspective
TLDR
This work proposes a matrix-based batch learning method for LSTM with full Backpropagation Through Time (BPTT) and solves the state drifting issues as well as improving the overall performance for L STM using revised activation functions for gates.
Revisit Long Short-Term Memory : An Optimization Perspective
TLDR
This work proposes a matrix-based batch learning method for LSTM with full Backpropagation Through Time (BPTT) and solves the state drifting issues as well as improving the overall performance for L STM using revised activation functions for gates.
Incremental training of first order recurrent neural networks to predict a context-sensitive language
Benchmarking of LSTM Networks
TLDR
Significant findings include: LSTM performance depends smoothly on learning rates, batching and momentum has no significant effect on performance, softmax training outperforms least square training, and peephole units are not useful.
Spoken language understanding using long short-term memory neural networks
TLDR
This paper investigates using long short-term memory (LSTM) neural networks, which contain input, output and forgetting gates and are more advanced than simple RNN, for the word labeling task and proposes a regression model on top of the LSTM un-normalized scores to explicitly model output-label dependence.
A generalized LSTM-like training algorithm for second-order recurrent neural networks
...
...

References

SHOWING 1-10 OF 23 REFERENCES
Learning to Forget: Continual Prediction with LSTM
TLDR
This work identifies a weakness of LSTM networks processing continual input streams that are not a priori segmented into subsequences with explicitly marked ends at which the network's internal state could be reset, and proposes a novel, adaptive forget gate that enables an LSTm cell to learn to reset itself at appropriate times, thus releasing internal resources.
Learning long-term dependencies with gradient descent is difficult
TLDR
This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.
Long Short-Term Memory
TLDR
A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Recurrent Neural Networks Can Learn to Implement Symbol-Sensitive Counting
TLDR
This work shows that a RNN can learn a harder CFL, a simple palindrome, by organizing its resources into a symbol-sensitive counting solution, and provides a dynamical systems analysis which demonstrates how the network can not only count, but also copy and store counting information.
Recurrent nets that time and count
  • F. Gers, J. Schmidhuber
  • Computer Science
    Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium
  • 2000
TLDR
Surprisingly, LSTM augmented by "peephole connections" from its internal cells to its multiplicative gates can learn the fine distinction between sequences of spikes separated by either 50 or 49 discrete time steps, without the help of any short training exemplars.
A Recurrent Neural Network that Learns to Count
TLDR
This research employs standard backpropagation training techniques for a recurrent neural network in the task of learning to predict the next character in a simple deterministic CFL (DCFL), and shows that an RNN can learn to recognize the structure of a simple DCFL.
Discrete recurrent neural networks for grammatical inference
TLDR
A novel neural architecture for learning deterministic context-free grammars, or equivalently, deterministic pushdown automata is described, and a composite error function is described to handle the different situations encountered in learning.
Learning Complex, Extended Sequences Using the Principle of History Compression
TLDR
A simple principle for reducing the descriptions of event sequences without loss of information is introduced and this insight leads to the construction of neural architectures that learn to divide and conquer by recursively decomposing sequences.
The Dynamics of Discrete-Time Computation, with Application to Recurrent Neural Networks and Finite State Machine Extraction
TLDR
It is shown that an RNN performing a finite state computation must organize its state space to mimic the states in the minimal deterministic finite state machine that can perform that computation, and a precise description of the attractor structure of such systems is given.
Analysis of Dynamical Recognizers
TLDR
This article presents an empirical method for testing whether the language induced by the network is regular, and provides a detailed "-machine analysis of trained networks for both regular and nonregular languages".
...
...