• Corpus ID: 61153617

Extreme Tensoring for Low-Memory Preconditioning

@article{Chen2020ExtremeTF,
  title={Extreme Tensoring for Low-Memory Preconditioning},
  author={Xinyi Chen and Naman Agarwal and Elad Hazan and Cyril Zhang and Yi Zhang},
  journal={ArXiv},
  year={2020},
  volume={abs/1902.04620}
}
State-of-the-art models are now trained with billions of parameters, reaching hardware limits in terms of memory consumption. This has created a recent demand for memory-efficient optimizers. To this end, we investigate the limits and performance tradeoffs of memory-efficient adaptively preconditioned gradient methods. We propose extreme tensoring for high-dimensional stochastic optimization, showing that an optimizer needs very little memory to benefit from adaptive preconditioning. Our… 

Figures and Tables from this paper

Stochastic Optimization with Laggard Data Pipelines
TLDR
It is shown that in convex optimization with stochastic minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
APOLLO: AN ADAPTIVE PARAMETER-WISE DIAG-
  • Computer Science
  • 2020
TLDR
APOLLO, a quasi-Newton method for nonconvex stochastic optimization, which dynamically incorporates the curvature of the loss function by approximating the Hessian via a diagonal matrix, which is as efficient as adaptive first-order optimization methods with linear complexity for both time and memory.
Disentangling Adaptive Gradient Methods from Learning Rates
TLDR
A "grafting" experiment is introduced which decouples an update's magnitude from its direction, finding that many existing beliefs in the literature may have arisen from insufficient isolation of the implicit schedule of step sizes.
Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization
TLDR
Apollo, a quasi-Newton method for nonconvex stochastic optimization, which dynamically incorporates the curvature of the loss function by approximating the Hessian via a diagonal matrix, is introduced.
Adaptive Gradient Methods with Local Guarantees
TLDR
This paper proposes an adaptive gradient method that has provable adaptive regret guarantees vs. the best local preconditioner, and proves a new adaptive regret bound in online learning that improves upon previous adaptive online learning methods.
Better Full-Matrix Regret via Parameter-Free Online Learning
TLDR
This work provides online convex optimization algorithms that guarantee improved fullmatrix regret bounds and improves the regret analysis of the full-matrix AdaGrad algorithm by suggesting a better learning rate value and showing how to tune the learning rate to this value on the fly.
Adaptive Online Learning with Varying Norms
TLDR
A new examination of the full-matrix AdaGrad algorithm is provided, suggesting a better learning rate value that improves significantly upon prior analysis and an improved bound in a concrete algorithm is realized.

References

SHOWING 1-10 OF 26 REFERENCES
Shampoo: Preconditioned Stochastic Tensor Optimization
TLDR
This work describes and analyzes a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces, which establishes convergence guarantees in the stochastically convex setting.
The Case for Full-Matrix Adaptive Regularization
TLDR
GGT is presented, a truly scalable full-matrix adaptive optimizer that converges to first-order local minima, providing the first rigorous theoretical analysis of adaptive regularization in non-convex optimization.
Efficient Full-Matrix Adaptive Regularization
TLDR
The preliminary experiments show improved iteration-wise convergence rates across synthetic tasks and standard deep learning benchmarks, and that the more carefully-preconditioned steps sometimes lead to a better solution.
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
TLDR
This work demonstrates empirically that adaptive methods can produce larger-than-desired updates when the decay rate of the second moment accumulator is too slow, and proposes update clipping and a gradually increasing decay rate scheme as remedies.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
TLDR
This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.
signSGD: compressed optimisation for non-convex problems
TLDR
SignSGD can get the best of both worlds: compressed gradients and SGD-level convergence rate, and the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models.
Kronecker-factored Curvature Approximations for Recurrent Neural Networks
TLDR
This work extends the K-FAC method to handle RNNs by introducing a novel approximation to the FIM for FIM, and demonstrates that this method significantly outperforms general purpose state-of-the-art optimizers like SGD with momentum and Adam on several challenging RNN training tasks.
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Optimizing Neural Networks with Kronecker-factored Approximate Curvature
TLDR
K-FAC is an efficient method for approximating natural gradient descent in neural networks which is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse.
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
TLDR
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.
...
...