• Corpus ID: 244773098

Adaptive Optimization with Examplewise Gradients

@article{Kunze2021AdaptiveOW,
  title={Adaptive Optimization with Examplewise Gradients},
  author={Julius Kunze and James Townsend and David Barber},
  journal={ArXiv},
  year={2021},
  volume={abs/2112.00174}
}
We propose a new, more general approach to the design of stochastic gradient-based optimization methods for machine learning. In this new framework, optimizers assume access to a batch of gradient estimates per iteration, rather than a single estimate. This better reflects the information that is actually available in typical machine learning setups. To demonstrate the usefulness of this generalized approach, we develop Eve, an adaptation of the Adam optimizer which uses examplewise gradients… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 20 REFERENCES

Adam: A Method for Stochastic Optimization

TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

On the Variance of the Adaptive Learning Rate and Beyond

TLDR
This work identifies a problem of the adaptive learning rate, suggests warmup works as a variance reduction technique, and proposes RAdam, a new variant of Adam, by introducing a term to rectify the variance of theadaptive learning rate.

Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients

TLDR
This analysis extends recent results on adverse effects of ADAM on generalization, isolating the sign aspect as the problematic one and transfers the variance adaptation to SGD gives rise to a novel method, completing the practitioner's toolbox for problems where ADAM fails.

ADADELTA: An Adaptive Learning Rate Method

We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational

Coupling Adaptive Batch Sizes with Learning Rates

TLDR
This work proposes a practical method for dynamic batch size adaptation that estimates the variance of the stochastic gradients and adapts the batch size to decrease the variance proportionally to the value of the objective function, removing the need for the aforementioned learning rate decrease.

BackPACK: Packing more into backprop

TLDR
BackPACK is introduced, an efficient framework built on top of PyTorch that extends the backpropagation algorithm to extract additional information from first-and second-order derivatives to address the problem of automatic differentiation frameworks not supporting other quantities such as the variance of the mini-batch gradients.

An Alternative View: When Does SGD Escape Local Minima?

TLDR
SGD will not get stuck at "sharp" local minima with small diameters, as long as the neighborhoods of these regions contain enough gradient information, which helps explain why SGD works so well for neural networks.

Proximal Policy Optimization Algorithms

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective

Attention is All you Need

TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Deep Residual Learning for Image Recognition

TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.