Corpus ID: 204401713

Decoupling Hierarchical Recurrent Neural Networks With Locally Computable Losses

@article{Mujika2019DecouplingHR,
  title={Decoupling Hierarchical Recurrent Neural Networks With Locally Computable Losses},
  author={Asier Mujika and Felix Weissenberger and A. Steger},
  journal={ArXiv},
  year={2019},
  volume={abs/1910.05245}
}
Learning long-term dependencies is a key long-standing challenge of recurrent neural networks (RNNs). Hierarchical recurrent neural networks (HRNNs) have been considered a promising approach as long-term dependencies are resolved through shortcuts up and down the hierarchy. Yet, the memory requirements of Truncated Backpropagation Through Time (TBPTT) still prevent training them on very long sequences. In this paper, we empirically show that in (deep) HRNNs, propagating gradients back from… Expand

References

SHOWING 1-10 OF 35 REFERENCES
Learning long-term dependencies with gradient descent is difficult
TLDR
This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods. Expand
A Clockwork RNN
TLDR
This paper introduces a simple, yet powerful modification to the simple RNN architecture, the Clockwork RNN (CW-RNN), in which the hidden layer is partitioned into separate modules, each processing inputs at its own temporal granularity, making computations only at its prescribed clock rate. Expand
Hierarchical Recurrent Neural Networks for Long-Term Dependencies
TLDR
This paper proposes to use a more general type of a-priori knowledge, namely that the temporal dependencies are structured hierarchically, which implies that long-term dependencies are represented by variables with a long time scale. Expand
Learning Recurrent Neural Networks with Hessian-Free Optimization
TLDR
This work solves the long-outstanding problem of how to effectively train recurrent neural networks on complex and difficult sequence modeling problems which may contain long-term data dependencies and offers a new interpretation of the generalized Gauss-Newton matrix of Schraudolph which is used within the HF approach of Martens. Expand
Hierarchical Multiscale Recurrent Neural Networks
TLDR
A novel multiscale approach, called the hierarchical multiscales recurrent neural networks, which can capture the latent hierarchical structure in the sequence by encoding the temporal dependencies with different timescales using a novel update mechanism is proposed. Expand
Long Short-Term Memory
TLDR
A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Expand
Optimal Kronecker-Sum Approximation of Real Time Recurrent Learning
TLDR
A new approximation algorithm of RTRL, Optimal Kronecker-Sum Approximation (OK), is presented and it is proved that OK is optimal for a class of approximations of R TRL, which includes all approaches published so far. Expand
Learning Complex, Extended Sequences Using the Principle of History Compression
TLDR
A simple principle for reducing the descriptions of event sequences without loss of information is introduced and this insight leads to the construction of neural architectures that learn to divide and conquer by recursively decomposing sequences. Expand
Z-Forcing: Training Stochastic Recurrent Networks
TLDR
This work unify successful ideas from recently proposed architectures into a stochastic recurrent model that achieves state-of-the-art results on standard speech benchmarks such as TIMIT and Blizzard and competitive performance on sequential MNIST. Expand
The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions
  • S. Hochreiter
  • Mathematics, Computer Science
  • Int. J. Uncertain. Fuzziness Knowl. Based Syst.
  • 1998
TLDR
The de-caying error flow is theoretically analyzed, methods trying to overcome vanishing gradients are briefly discussed, and experiments comparing conventional algorithms and alternative methods are presented. Expand
...
1
2
3
4
...