• Corpus ID: 235825914

Learn2Hop: Learned Optimization on Rough Landscapes

  title={Learn2Hop: Learned Optimization on Rough Landscapes},
  author={Amil Merchant and Luke Metz and Samuel S. Schoenholz and Ekin Dogus Cubuk},
Optimization of non-convex loss surfaces containing many local minima remains a critical problem in a variety of domains, including operations research, informatics, and material design. Yet, current techniques either require extremely high iteration counts or a large number of random restarts for good performance. In this work, we propose adapting recent developments in meta-learning to these many-minima problems by learning the optimization algorithm for various loss landscapes. We focus on… 

Figures and Tables from this paper

StriderNET: A Graph Reinforcement Learning Approach to Optimize Atomic Structures on Rough Energy Landscapes

S TRIDER NET presents a promising framework that enables the optimization of atomic structures on a rough landscape and outperforms the classical optimization algorithms such as gradient descent, FIRE, and Adam.

Discovering Evolution Strategies via Meta-Black-Box Optimization

This work proposes to discover effective update rules for evolution strategies via meta-learning, and employs a search strategy parametrized by a self-attention-based architecture, which guarantees the update rule is invariant to the ordering of the candidate solutions.

Transformer-Based Learned Optimization

The main innovation is to propose a new neural network architecture for the learned optimizer inspired by the classic BFGS algorithm that allows for conditioning across different dimensions of the parameter space of the target problem while remaining applicable to optimization tasks of variable dimensionality without re-training.

VeLO: Training Versatile Learned Optimizers by Scaling Up

An optimizer for deep learning is trained which is itself a small neural network that ingests gradients and outputs parameter updates, and requires no hyperparameter tuning, instead automatically adapting to the specifics of the problem being optimized.

Tutorial on amortized optimization for learning to optimize over continuous domains

This tutorial discusses the key design choices behind amortized optimization, roughly categorizing models into fully-amortized and semi-Amortized approaches, and learning methods into regression-based and objectivebased approaches.

Learning to Generalize Provably in Learning to Optimize

This paper theoretically establish an implicit connection between the local entropy and the Hessian and hence unify their roles in the handcrafted design of generalizable optimizers as equivalent metrics of the landscape flatness of loss functions and proposes to incorporate these two metrics as flatness-aware regularizers into the L2O framework in order to meta-train optimizers to learn to generalize.

Learning to Optimize: A Primer and A Benchmark

This article is poised to be the first comprehensive survey and benchmark of L2O for continuous optimization, set up taxonomies, categorize existing works and research directions, present insights, and identify open challenges.

evosax: JAX-based Evolution Strategies

The deep learning revolution has greatly been accelerated by the ’hardware lottery’: Recent advances in modern hardware accelerators and compilers paved the way for large-scale batch gradient

Distributional Reinforcement Learning for Scheduling of Chemical Production Processes

Reinforcement Learning (RL) has recently received significant attention from the process systems engineering and control communities. Recent works have investigated the application of RL to identify



Reverse engineering learned optimizers reveals known and novel mechanisms

This work studies learned optimizers trained from scratch on three disparate tasks, and discovers that they have learned interpretable mechanisms, including: momentum, gradient clipping, learning rate schedules, and a new form of learning rate adaptation.

Learned Optimizers that Scale and Generalize

This work introduces a learned gradient descent optimizer that generalizes well to new tasks, and which has significantly reduced memory and computation overhead, by introducing a novel hierarchical RNN architecture with minimal per-parameter overhead.

Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves

This work introduces a new, neural network parameterized, hierarchical optimizer with access to additional features such as validation loss to enable automatic regularization and shows evidence of being useful for out of distribution tasks such as training themselves from scratch.

Evolution of the Potential Energy Surface with Size for Lennard-Jones Clusters

Disconnectivity graphs are used to characterize the potential energy surfaces of Lennard-Jones clusters containing 13, 19, 31, 38, 55, and 75 atoms. This set includes members which exhibit either one

Learning Gradient Descent: Better Generalization and Longer Horizons

This paper proposes a new learning-to-learn model and some useful and practical tricks, and demonstrates the effectiveness of the algorithms on a number of tasks, including deep MLPs, CNNs, and simple LSTMs.

Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Understanding and correcting pathologies in the training of learned optimizers

This work proposes a training scheme which overcomes both of these difficulties, by dynamically weighting two unbiased gradient estimators for a variational loss on optimizer performance, allowing us to train neural networks to perform optimization of a specific task faster than tuned first-order methods.

Neural Message Passing for Quantum Chemistry

Using MPNNs, state of the art results on an important molecular property prediction benchmark are demonstrated and it is believed future work should focus on datasets with larger molecules or more accurate ground truth labels.

Learned optimizers that outperform SGD on wall-clock and test loss

This work proposes a training scheme which overcomes both of these difficulties, by dynamically weighting two unbiased gradient estimators for a variational loss on optimizer performance, and is able to learn optimizers that train networks to better generalization than first order methods.

The Loss Surfaces of Multilayer Networks

It is proved that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.