• Corpus ID: 222291172

Efficient Wasserstein Natural Gradients for Reinforcement Learning

  title={Efficient Wasserstein Natural Gradients for Reinforcement Learning},
  author={Theodore H. Moskovitz and Michal Arbel and Ferenc Husz{\'a}r and Arthur Gretton},
A novel optimization approach is proposed for application to policy gradient methods and evolution strategies for reinforcement learning (RL). The procedure uses a computationally efficient Wasserstein natural gradient (WNG) descent that takes advantage of the geometry induced by a Wasserstein penalty to speed optimization. This method follows the recent theme in RL of including a divergence penalty in the objective to establish a trust region. Experiments on challenging tasks demonstrate… 

Figures from this paper

Towards an Understanding of Default Policies in Multitask Policy Optimization
This work formally linking the qual-ity of the default policy to its quality on optimization, and derives a principled RPO algorithm for multitask learning with strong performance guarantees.
A First-Occupancy Representation for Reinforcement Learning
The first-occupancy representation (FR) is introduced, which measures the expected temporal discount to the first time a state is accessed and facilitates the selection of efficient paths to desired states, allows the agent, under certain conditions, to plan provably optimal trajectories defined by a sequence of subgoals, and induces similar behavior to animals avoiding threatening stimuli.
Deep Reinforcement Learning with Dynamic Optimism
This work shows that the optimal degree of optimism can vary both across tasks and over the course of learning, and introduces a novel deep actor-critic algorithm, Dynamic Optimistic and Pessimistic Estimation (DOPE), to switch between optimistic and pessimistic value learning online by formulating the selection as a multi-arm bandit problem.
MICo: Improved representations via sampling-based state similarity for Markov decision processes
A new behavioural distance over the state space of a Markov decision process is presented, and empirical evidence that learning this distance alongside the value function yields structured and informative representations, including strong results on the Arcade Learning Environment benchmark.
Tactical Optimism and Pessimism for Deep Reinforcement Learning
This work shows that the most effective degree of optimism can vary both across tasks and over the course of learning, and introduces a novel deep actor-critic framework, Tactical Optimistic and Pessimistic (TOP) estimation, which switches between optimistic and pessimistic value learning online.


Behavior-Guided Reinforcement Learning
A new approach for comparing reinforcement learning policies, using Wasserstein distances in a newly defined latent behavioral space, and utilizing the dual formulation of the WD to learn score functions over trajectories that can be in turn used to lead policy optimization towards (or away from) (un)desired behaviors.
A Natural Policy Gradient
This work provides a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space and shows drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.
Implicit Policy for Reinforcement Learning
It is empirically show that, despite its simplicity in implementation, entropy regularization combined with a rich policy class can attain desirable properties displayed under maximum entropy reinforcement learning framework, such as robustness and multi-modality.
Proximal Policy Optimization Algorithms
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective
Kernelized Wasserstein Natural Gradient
This work proposes a general framework to approximate the natural gradient for the Wasserstein metric, by leveraging a dual formulation of the metric restricted to a Reproducing Kernel Hilbert Space, and leads to an estimator for gradient direction that can trade-off accuracy and computational cost, with theoretical guarantees.
Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents
This paper shows that algorithms that have been invented to promote directed exploration in small-scale evolved neural networks via populations of exploring agents, specifically novelty search and quality diversity algorithms, can be hybridized with ES to improve its performance on sparse or deceptive deep RL tasks, while retaining scalability.
Trust Region Policy Optimization
A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).
Provably Robust Blackbox Optimization for Reinforcement Learning
This paper proposes a new class of algorithms, called Robust Blackbox Optimization (RBO), which relies on learning gradient flows using robust regression methods to enable off-policy updates and is able to train policies effectively on several MuJoCo robot control tasks.
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
This work explores the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Q-learning and Policy Gradients, and highlights several advantages of ES as a blackbox optimization technique.
Stochastic Optimization for Large-scale Optimal Transport
A new class of stochastic optimization algorithms to cope with large-scale problems routinely encountered in machine learning applications, based on entropic regularization of the primal OT problem, which results in a smooth dual optimization optimization which can be addressed with algorithms that have a provably faster convergence.