On the importance of initialization and momentum in deep learning


Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter , it can train both DNNs and RNNs (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and well-initialized networks perform markedly worse when the momentum is absent or poorly tuned. Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initial-izations have likely failed due to poor ini-tialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods.

Extracted Key Phrases

11 Figures and Tables

Showing 1-10 of 25 references

A method of solving a convex programming problem with convergence rate O(1/sqr(k)). Soviet Mathematics Doklady

  • Y Nesterov
  • 1983
Highly Influential
4 Excerpts

Introductory lectures on convex optimization: A basic course

  • Y Nesterov
  • 2003
Highly Influential
2 Excerpts
Showing 1-10 of 338 extracted citations
Citations per Year

555 Citations

Semantic Scholar estimates that this publication has received between 465 and 666 citations based on the available data.

See our FAQ for additional information.