Learning Complex, Extended Sequences Using the Principle of History Compression

  title={Learning Complex, Extended Sequences Using the Principle of History Compression},
  author={J{\"u}rgen Schmidhuber},
  journal={Neural Computation},
  • J. Schmidhuber
  • Published 1 March 1992
  • Computer Science
  • Neural Computation
Previous neural network learning algorithms for sequence processing are computationally expensive and perform poorly when it comes to long time lags. This paper first introduces a simple principle for reducing the descriptions of event sequences without loss of information. A consequence of this principle is that only unexpected inputs can be relevant. This insight leads to the construction of neural architectures that learn to divide and conquer by recursively decomposing sequences. I describe… 
Continuous history compression
A contininuous version of history compression is described in which elements are discarded in a graded fashion dependent on their predictability, embodied by their (Shannon) information.
Learning Unambiguous Reduced Sequence Descriptions
Experiments show that systems based on these principles can require less computation per time step and many fewer training sequences than conventional training algorithms for recurrent nets.
Variable Computation in Recurrent Neural Networks
A modification to existing recurrent units is explored which allows them to learn to vary the amount of computation they perform at each step, without prior knowledge of the sequence's time structure, which leads to better performance overall on evaluation tasks.
This work discusses a method involving only two recurrent networks which tries to collapse a multi-level predictor hierarchy into a single recurrent net, thus easing supervised or reinforcement learning tasks.
Decoupling Hierarchical Recurrent Neural Networks With Locally Computable Losses
It is empirically show that in (deep) HRNNs, propagating gradients back from higher to lower levels can be replaced by locally computable losses, without harming the learning capability of the network, over a wide range of tasks.
A Clockwork RNN
This paper introduces a simple, yet powerful modification to the simple RNN architecture, the Clockwork RNN (CW-RNN), in which the hidden layer is partitioned into separate modules, each processing inputs at its own temporal granularity, making computations only at its prescribed clock rate.
The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions
  • S. Hochreiter
  • Computer Science
    Int. J. Uncertain. Fuzziness Knowl. Based Syst.
  • 1998
The de-caying error flow is theoretically analyzed, methods trying to overcome vanishing gradients are briefly discussed, and experiments comparing conventional algorithms and alternative methods are presented.
Hierarchical Recurrent Neural Networks for Long-Term Dependencies
This paper proposes to use a more general type of a-priori knowledge, namely that the temporal dependencies are structured hierarchically, which implies that long-term dependencies are represented by variables with a long time scale.
Eigenvalue Normalized Recurrent Neural Networks for Short Term Memory
This paper proposes an architecture that expands upon an orthogonal/unitary RNN with a state that is generated by a recurrent matrix with eigenvalues in the unit disc, called the Eigenvalue Normalized RNN (ENRNN), which is shown to be highly competitive in several experiments.
Hierarchical Multiscale Recurrent Neural Networks
A novel multiscale approach, called the hierarchical multiscales recurrent neural networks, which can capture the latent hierarchical structure in the sequence by encoding the temporal dependencies with different timescales using a novel update mechanism is proposed.


Neural Sequence Chunkers
Experiments show that chunking systems can be superior to the conventional training algorithms for recurrent nets, and a focus is on a class of 2-network systems which try to collapse a self-organizing hierarchy of temporal predictors into a single recurrent network.
Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks
This paper describes an alternative class of gradient-based systems consisting of two feedforward nets that learn to deal with temporal sequences using fast weights: the first net learns to produce context-dependent weight changes for the second net whose weights may vary very quickly.
Adaptive Decomposition Of Time
Design principles for unsupervised detection of regularities (like causal relationships) in temporal sequences and the principles of the rst neural sequencèchunker, which collapses a self-organizing multi-level predictor hierarchy into a single recurrent network are introduced.
Experimental Analysis of the Real-time Recurrent Learning Algorithm
A series of simulation experiments are used to investigate the power and properties of the real-time recurrent learning algorithm, a gradient-following learning algorithm for completely recurrent networks running in continually sampled time.
A Fixed Size Storage O(n3) Time Complexity Learning Algorithm for Fully Recurrent Continually Running Networks
A method suited for on-line learning that computes exactly the same gradient and requires fixed-size storage of the same order but has an average time complexity per time step of O(n3).
An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories
A novel variant of the familiar backpropagation-through-time approach to training recurrent networks is described. This algorithm is intended to be used on arbitrary recurrent networks that run
Learning Factorial Codes by Predictability Minimization
A novel general principle for unsupervised learning of distributed nonredundant internal representations of input patterns based on two opposing forces that has a potential for removing not only linear but also nonlinear output redundancy.
Learning with Delayed Reinforcement Through Attention-Driven Buffering
  • C. Myers
  • Computer Science, Psychology
    Int. J. Neural Syst.
  • 1991
A method for accomplishing this is presented which buffers a small number of past actions based on the unpredictability of or attention to each as it occurs, which allows for the buffer size to be small, and yet learning can reach indefinitely far back into the past.
Learning to control fast-weight memories: An alternative to recurrent nets
  • Tech. Rep. FKI-147-91, Institut fur Informatik, Technische
  • 1991
An O(n 3 ) learning algorithm for fully recurrent networks
  • An O(n 3 ) learning algorithm for fully recurrent networks
  • 1991