• Corpus ID: 234482858

Slower is Better: Revisiting the Forgetting Mechanism in LSTM for Slower Information Decay

@article{Chien2021SlowerIB,
  title={Slower is Better: Revisiting the Forgetting Mechanism in LSTM for Slower Information Decay},
  author={Hsiang-Yun Sherry Chien and Javier Turek and Nicole M. Beckage and Vy A. Vo and Christopher John Honey and Ted L. Willke},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.05944}
}
Sequential information contains short- to long-range dependencies; however, learning long-timescale information has been a challenge for recurrent neural networks. Despite improvements in long short-term memory networks (LSTMs), the forgetting mechanism results in the exponential decay of information, limiting their capacity to capture long-timescale information. Here, we propose a power law forget gate, which instead learns to forget information along a slower power law decay function… 

References

SHOWING 1-10 OF 31 REFERENCES

Multi-timescale representation learning in LSTM Language Models

TLDR
This work constructs explicitly multi-timescale language models by manipulating the input and forget gate biases in a long short-term memory (LSTM) network and empirically analyze the timescale of information routed through each part of the model using word ablation experiments and forgetting gate visualizations.

Long Short-Term Memory

TLDR
A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

Gated Orthogonal Recurrent Units: On Learning to Forget

We present a novel recurrent neural network (RNN)–based model that combines the remembering ability of unitary evolution RNNs with the ability of gated RNNs to effectively forget redundant or

Towards Non-saturating Recurrent Units for Modelling Long-term Dependencies

TLDR
This work proposes a new recurrent architecture (Non-saturating Recurrent Unit; NRU) that relies on a memory mechanism but forgoes both saturating activation functions and saturating gates, in order to further alleviate vanishing gradients.

Improving the Gating Mechanism of Recurrent Neural Networks

TLDR
Two synergistic modifications to the standard gating mechanism that are easy to implement, introduce no additional hyperparameters, and improve learnability of the gates when they are close to saturation robustly improve the performance of recurrent models on a range of applications.

The unreasonable effectiveness of the forget gate

TLDR
This work shows that a forget-gate-only version of the LSTM with chrono-initialized biases, not only provides computational savings but outperforms the standard L STM on multiple benchmark datasets and competes with some of the best contemporary models.

Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies

TLDR
It is concluded that LSTMs can capture a non-trivial amount of grammatical structure given targeted supervision, but stronger architectures may be required to further reduce errors; furthermore, the language modeling signal is insufficient for capturing syntax-sensitive dependencies, and should be supplemented with more direct supervision if such dependencies need to be captured.

Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences

TLDR
This work introduces the Phased LSTM model, which extends the L STM unit by adding a new time gate, controlled by a parametrized oscillation with a frequency range which require updates of the memory cell only during a small percentage of the cycle.

Learning Longer Memory in Recurrent Neural Networks

TLDR
This paper shows that learning longer term patterns in real data, such as in natural language, is perfectly possible using gradient descent, by using a slight structural modification of the simple recurrent neural network architecture.

Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks

TLDR
Backpropagation through the ODE solver allows each layer to adapt its internal time-step, enabling the network to learn task-relevant time-scales and exceed state-of-the-art performance among RNNs on permuted sequential MNIST.