• Corpus ID: 218596019

Convergence of Online Adaptive and Recurrent Optimization Algorithms

  title={Convergence of Online Adaptive and Recurrent Optimization Algorithms},
  author={Pierre-Yves Mass'e and Yann Ollivier},
  journal={arXiv: Dynamical Systems},
We prove local convergence of several notable gradient descent algorithms used in machine learning, for which standard stochastic gradient descent theory does not apply. This includes, first, online algorithms for recurrent models and dynamical systems, such as \emph{Real-time recurrent learning} (RTRL) and its computationally lighter approximations NoBackTrack and UORO; second, several adaptive algorithms such as RMSProp, online natural gradient, and Adam with $\beta^2\to 1$. Despite local… 
Prediction of the Position of External Markers Using a Recurrent Neural Network Trained With Unbiased Online Recurrent Optimization for Safe Lung Cancer Radiotherapy
This research uses nine observation records of the three-dimensional position of three external markers on the chest and abdomen of healthy individuals breathing during intervals from 73s to 222s to compare its performance with an RNN trained with real-time recurrent learning, least mean squares (LMS), and offline linear regression.


On the Convergence of Adam and Beyond
It is shown that one cause for such failures is the exponential moving average used in the algorithms, and suggested that the convergence issues can be fixed by endowing such algorithms with `long-term memory' of past gradients.
Unbiased Online Recurrent Optimization
The novel Unbiased Online Recurrent Optimization (UORO) algorithm allows for online learning of general recurrent computational graphs such as recurrent network models and performs well thanks to the unbiasedness of its gradients.
Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Training recurrent networks online without backtracking
Preliminary tests on a simple task show that the stochastic approximation of the gradient introduced in the algorithm does not seem to introduce too much noise in the trajectory, compared to maintaining the full gradient, and confirm the good performance and scalability of the Kalman-like version of NoBackTrack.
Approximating Real-Time Recurrent Learning with Random Kronecker Factors
It is shown that KF-RTRL is an unbiased and memory efficient online learning algorithm that captures long-term dependencies and almost matches the performance of TBPTT on real world tasks by training Recurrent Highway Networks on a synthetic string memorization task and on the Penn TreeBank task, respectively.
Gradient calculations for dynamic recurrent neural networks: a survey
The author discusses advantages and disadvantages of temporally continuous neural networks in contrast to clocked ones and presents some "tricks of the trade" for training, using, and simulating continuous time and recurrent neural networks.
Why random reshuffling beats stochastic gradient descent
The convergence rate of the random reshuffling method is analyzed and it is shown that when the component functions are quadratics or smooth and the sum function is strongly convex, RR with iterate averaging and a diminishing stepsize converges at rate $\Theta(1/k^{2s})$ with probability one in the suboptimality of the objective value, thus improving upon the $\Omega( 1/k)$ rate of SGD.
Curiously Fast Convergence of some Stochastic Gradient Descent Algorithms
1 Context Given a finite set of m examples z 1 ,. .. , z m and a strictly convex differen-tiable loss function ℓ(z, θ) defined on a parameter vector θ ∈ R d , we are interested in minimizing the cost
The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning
It is shown here that stability of the stochastic approximation algorithm is implied by the asymptotic stability of the origin for an associated ODE. This in turn implies convergence of the
Gradient Descent Learns Linear Dynamical Systems
We prove that gradient descent efficiently converges to the global optimizer of the maximum likelihood objective of an unknown linear time-invariant dynamical system from a sequence of noisy