# MomentumRNN: Integrating Momentum into Recurrent Neural Networks

@article{Nguyen2020MomentumRNNIM, title={MomentumRNN: Integrating Momentum into Recurrent Neural Networks}, author={Tan Nguyen and Richard Baraniuk and A. Bertozzi and S. Osher and Baorui Wang}, journal={ArXiv}, year={2020}, volume={abs/2006.06919} }

Designing deep neural networks is an art that often involves an expensive search over candidate architectures. To overcome this for recurrent neural nets (RNNs), we establish a connection between the hidden state dynamics in an RNN and gradient descent (GD). We then integrate momentum into this framework and propose a new family of RNNs, called {\em MomentumRNNs}. We theoretically prove and numerically demonstrate that MomentumRNNs alleviate the vanishing gradient issue in training RNNs. We…

## Figures and Tables from this paper

## 8 Citations

AIR-Net: Adaptive and Implicit Regularization Neural Network for Matrix Completion

- Computer Science, MathematicsArXiv
- 2021

Theoretically, it is shown that the adaptive regularization of AIR-Net enhances the implicit regularization and vanishes at the end of training, and the model’s effectiveness on various benchmark tasks is validated, indicating that the AIR- net is particularly favorable for the scenarios when the missing entries are non-uniform.

Attention network forecasts time‐to‐failure in laboratory shear experiments

- Physics, Computer ScienceJournal of Geophysical Research: Solid Earth
- 2021

A method using unsupervised classification and an attention network to forecast labquakes using AE waveform features to combine machine learning with expert knowledge about earthquake formation to forecast synthetic earthquakes made in a laboratory.

Deep Incremental RNN for Learning Sequential Data: A Lyapunov Stable Dynamical System

- 2021

With the recent advances in mobile sensing technologies, large amounts of sequential data are collected, such as vehicle GPS records, stock prices, sensor data from air quality detectors. Recurrent…

Heavy Ball Neural Ordinary Differential Equations

- Computer Science, MathematicsArXiv
- 2021

This work proposes heavy ball neural ordinary differential equations (HBNODEs), leveraging the continuous limit of the classical momentum accelerated gradient descent, to improve neural ODEs (NODEs) training and inference, and verifies the advantages of HBNodes over NODEs on benchmark tasks, including image classification, learning complex dynamics, and sequential modeling.

How Does Momentum Benefit Deep Neural Networks Architecture Design? A Few Case Studies

- Computer Science, MathematicsArXiv
- 2021

It is shown that integrating momentum into neural network architectures has several remarkable theoretical and empirical benefits, including one that can overcome the vanishing gradient issues in training RNNs and neural ODEs, resulting in effective learning long-term dependencies.

Lipschitz Recurrent Neural Networks

- Computer Science, MathematicsICLR
- 2021

This work proposes a recurrent unit that describes the hidden state's evolution with two parts: a well-understood linear component plus a Lipschitz nonlinearity, which is more robust with respect to input and parameter perturbations as compared to other continuous-time RNNs.

Momentum Residual Neural Networks

- Computer Science, MathematicsICML
- 2021

This paper proposes to change the forward rule of a ResNet by adding a momentum term, and the resulting networks, Momentum ResNets, are invertible and can be used as a dropin replacement for any existing ResNet block.

SBO-RNN: Reformulating Recurrent Neural Networks via Stochastic Bilevel Optimization

- 2021

In this paper we consider the training stability of recurrent neural networks (RNNs), and propose a family of RNNs, namely SBO-RNN, that can be formulated using stochastic bilevel optimization (SBO).…

## References

SHOWING 1-10 OF 66 REFERENCES

On the importance of initialization and momentum in deep learning

- Computer ScienceICML
- 2013

It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.

Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN

- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018

It is shown that an IndRNN can be easily regulated to prevent the gradient exploding and vanishing problems while allowing the network to learn long-term dependencies and work with non-saturated activation functions such as relu and be still trained robustly.

Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent

- Computer Science, MathematicsArXiv
- 2020

SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule, a new NAG-style scheme for training DNNs.

RNNs Evolving in Equilibrium: A Solution to the Vanishing and Exploding Gradients

- Physics, Computer ScienceArXiv
- 2019

This work proposes a family of novel RNNs, namely Em Equilibriated Recurrent Neural Networks (ERNNs), that overcome the gradient decay or explosion effect and lead to recurrent models that evolve on the equilibrium manifold.

RNNs Evolving on an Equilibrium Manifold: A Panacea for Vanishing and Exploding Gradients?

- Computer Science, Mathematics
- 2019

This work proposes a family of novel RNNs, namely Em Equilibriated Recurrent Neural Networks (ERNNs), that overcome the gradient decay or explosion effect and lead to recurrent models that evolve on the equilibrium manifold.

Improving performance of recurrent neural network with relu nonlinearity

- Computer ScienceArXiv
- 2015

This paper offers a simple dynamical systems perspective on weight initialization process, which allows for a modified weight initialization strategy, and shows that this initialization technique leads to successfully training RNNs composed of ReLUs.

Regularizing and Optimizing LSTM Language Models

- Computer ScienceICLR
- 2018

This paper proposes the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization and introduces NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user.

Unitary Evolution Recurrent Neural Networks

- Computer Science, MathematicsICML
- 2016

This work constructs an expressive unitary weight matrix by composing several structured matrices that act as building blocks with parameters to be learned, and demonstrates the potential of this architecture by achieving state of the art results in several hard tasks involving very long-term dependencies.

Advances in optimizing recurrent networks

- Computer Science2013 IEEE International Conference on Acoustics, Speech and Signal Processing
- 2013

Experiments reported here evaluate the use of clipping gradients, spanning longer time ranges with leaky integration, advanced momentum techniques, using more powerful output probability models, and encouraging sparser gradients to help symmetry breaking and credit assignment.

Recurrent Neural Networks in the Eye of Differential Equations

- Computer Science, PhysicsArXiv
- 2019

It is proved that popular RNN architectures, such as LSTM and URNN, fit into different orders of $n$-$t$-ODERNNs and it is shown that the degree of RNN's functional nonlinearity and the range of its temporal memory can be mapped to the corresponding stage of Runge-Kutta recursion and the order of time-derivative of the ODEs.