Learn More
We introduce DropConnect, a generalization of Dropout (Hinton et al., 2012), for regularizing large fully-connected layers within neural networks. When training with Dropout, a randomly selected subset of activations are set to zero within each layer. DropConnect instead sets a randomly selected subset of weights within the network to zero. Each unit thus(More)
The performance of stochastic gradient descent (SGD) depends critically on how learning rates are tuned and decreased over time. We propose a method to automatically adjust multiple learning rates so as to minimize the expected error at any one time. The method relies on local gradient variations across samples. In our approach, learning rates can increase(More)
We study the problem of how to distribute the training of large-scale deep learning models in the parallel computing environment. We propose a new distributed stochastic optimization method called Elastic Averaging SGD (EASGD). We analyze the convergence rate of the EASGD method in the synchronous scenario and compare its stability condition with the(More)
If we do gradient descent with η * (t), then almost surely, the algorithm converges (for the quadratic model). To prove that, we follow classical techniques based on Lyapunov stability theory (Bucy, 1965). Notice that the expected loss follows E J θ (t+1) | θ (t) = 1 2 h · E (1 − η * h)(θ (t) − θ *) + η * hσξ 2 + σ 2 = 1 2 h (1 − η * h) 2 (θ (t) − θ *) 2 +(More)
Recent developments in the field of deep learning have shown that convolutional networks with several layers can approach human level accuracy in tasks such as handwritten digit classification and object recognition. It is observed that the state-of-the-art performance is obtained from model ensembles, where several models are trained on the same data and(More)
  • 1