Corpus ID: 1770217

Accelerating SGD for Distributed Deep-Learning Using Approximated Hessian Matrix

  title={Accelerating SGD for Distributed Deep-Learning Using Approximated Hessian Matrix},
  author={S{\'e}bastien M. R. Arnold and Chunming Wang},
We introduce a novel method to compute a rank $m$ approximation of the inverse of the Hessian matrix in the distributed regime. By leveraging the differences in gradients and parameters of multiple Workers, we are able to efficiently implement a distributed approximation of the Newton-Raphson method. We also present preliminary results which underline advantages and challenges of second-order methods for large stochastic optimization problems. In particular, our work suggests that novel… Expand


Large Scale Distributed Deep Networks
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training. Expand
Asynchrony begets momentum, with an application to deep learning
It is shown that running stochastic gradient descent in an asynchronous manner can be viewed as adding a momentum-like term to the SGD iteration, and an important implication is that tuning the momentum parameter is important when considering different levels of asynchrony. Expand
Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks, and introduces a new learningrate modulation strategy to counter the effect of stale gradients and proposes a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Expand
Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
This paper proposes a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods, and applies this algorithm to deep or recurrent neural network training, and provides numerical evidence for its superior optimization performance. Expand
Deep Residual Learning for Image Recognition
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. Expand
FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters
FireCaffe is presented, which successfully scales deep neural network training across a cluster of GPUs, and finds that reduction trees are more efficient and scalable than the traditional parameter server approach. Expand
ImageNet classification with deep convolutional neural networks
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective. Expand
Rectified Linear Units Improve Restricted Boltzmann Machines
Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset. Expand
Scaling Distributed Machine Learning with the Parameter Server
View on new challenges identified are shared, and some of the application scenarios such as micro-blog data analysis and data processing in building next generation search engines are covered. Expand
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets. Expand