• Corpus ID: 3455897

as a conference paper at ICLR 2018 O N THE CONVERGENCE OF A DAM AND B EYOND

  title={as a conference paper at ICLR 2018 O N THE CONVERGENCE OF A DAM AND B EYOND},
  author={Uxiliary and Emma},
Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSPROP, ADAM, ADADELTA, NADAM are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. In many applications, e.g. learning with large output spaces, it has been empirically observed that these algorithms fail to converge to an optimal solution (or a critical point in nonconvex settings). We show that one cause… 
13 Citations

Figures from this paper

Dropout with Expectation-linear Regularization
This work first formulate dropout as a tractable approximation of some latent variable model, leading to a clean view of parameter sharing and enabling further theoretical analysis, and introduces (approximate) expectation-linear dropout neural networks, whose inference gap the authors are able to formally characterize.
Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights
Extensive experiments on the ImageNet classification task using almost all known deep CNN architectures including AlexNet, VGG-16, GoogleNet and ResNets well testify the efficacy of the proposed INQ, showing that at 5-bit quantization, models have improved accuracy than the 32-bit floating-point references.
Super-Resolution with Deep Convolutional Sufficient Statistics
This paper proposes to use as conditional model a Gibbs distribution, where its sufficient statistics are given by deep convolutional neural networks, and the features computed by the network are stable to local deformation, and have reduced variance when the input is a stationary texture.
Energy-based Generative Adversarial Network
We introduce the "Energy-based Generative Adversarial Network" model (EBGAN) which views the discriminator as an energy function that attributes low energies to the regions near the data manifold and
Deep Variational Information Bottleneck
It is shown that models trained with the VIB objective outperform those that are trained with other forms of regularization, in terms of generalization performance and robustness to adversarial attack.
Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer
This work shows that, by properly defining attention for convolutional neural networks, this type of information can be used in order to significantly improve the performance of a student CNN network by forcing it to mimic the attention maps of a powerful teacher network.
Delving into Transferable Adversarial Examples and Black-box Attacks
This work is the first to conduct an extensive study of the transferability over large models and a large scale dataset, and it is also theFirst to study the transferabilities of targeted adversarial examples with their target labels.
Multi-Scale Context Aggregation by Dilated Convolutions
This work develops a new convolutional network module that is specifically designed for dense prediction, and shows that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems.
Sparsely-Connected Neural Networks: Towards Efficient VLSI Implementation of Deep Neural Networks
Sparsely-connected neural networks are proposed, by showing that the number of connections in fully-connected networks can be reduced by up to 90% while improving the accuracy performance on three popular datasets while proposing an efficient hardware architecture based on linear-feedback shift registers to reduce the memory requirements of the proposed sparsely- connected networks.
DeepCoder: Learning to Write Programs
The approach is to train a neural network to predict properties of the program that generated the outputs from the inputs to augment search techniques from the programming languages community, including enumerative search and an SMT-based solver.


Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.
On the generalization ability of on-line learning algorithms
This paper proves tight data-dependent bounds for the risk of this hypothesis in terms of an easily computable statistic M/sub n/ associated with the on-line performance of the ensemble, and obtains risk tail bounds for kernel perceptron algorithms interms of the spectrum of the empirical kernel matrix.
Adaptive and Self-Confident On-Line Learning Algorithms
This paper shows that essentially the same optimized bounds can be obtained when the algorithms adaptively tune their learning rates as the examples in the sequence are progressively revealed, as they depend on the whole sequence of examples.
Dropout: a simple way to prevent neural networks from overfitting
It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
ImageNet classification with deep convolutional neural networks
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Adaptive Bound Optimization for Online Convex Optimization
This work introduces a new online convex optimization algorithm that adaptively chooses its regularization function based on the loss functions observed so far, and proves competitive guarantees that show the algorithm provides a bound within a constant factor of the best possible bound in hindsight in hindsight.
Online Convex Programming and Generalized Infinitesimal Gradient Ascent
An algorithm for convex programming is introduced, and it is shown that it is really a generalization of infinitesimal gradient ascent, and the results here imply that generalized inf initesimalgradient ascent (GIGA) is universally consistent.
RmsProp: Divide the gradient by a running average of its recent magnitude
  • COURSERA: Neural Networks for Machine Learning,
  • 2012
Incorporating Nesterov Momentum into Adam