• Corpus ID: 226254184

Harnessing Distribution Ratio Estimators for Learning Agents with Quality and Diversity

  title={Harnessing Distribution Ratio Estimators for Learning Agents with Quality and Diversity},
  author={Tanmay Gangwani and Jian Peng and Yuanshuo Zhou},
Quality-Diversity (QD) is a concept from Neuroevolution with some intriguing applications to Reinforcement Learning. It facilitates learning a population of agents where each member is optimized to simultaneously accumulate high task-returns and exhibit behavioral diversity compared to other members. In this paper, we build on a recent kernel-based method for training a QD policy ensemble with Stein variational gradient descent. With kernels based on $f$-divergence between the stationary… 

Figures and Tables from this paper

Discovering Diverse Nearly Optimal Policies withSuccessor Features
Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while assuring that they are near optimal while remaining near-optimal with respect to the extrinsic reward of the MDP is proposed.
Modelling Behavioural Diversity for Learning in Open-Ended Games
By incorporating the diversity metric into best-response dynamics, this work develops diverse fictitious play and diverse policy-space response oracle for solving normalform games and open-ended games and proves the uniqueness of the diverse best response and the convergence of the algorithms on two-player games.
Policy gradient assisted MAP-Elites
PGA-MAP-Elites is presented, a novel algorithm that enables MAP-Elite to efficiently evolve large neural network controllers by introducing a gradient-based variation operator inspired by Deep Reinforcement Learning.


Learning Self-Imitating Diverse Policies
A self-imitation learning algorithm that exploits and explores well in the sparse and episodic reward settings, and can be reduced into a policy-gradient algorithm with shaped rewards learned from experience replays with Stein variational policy gradient descent with Jensen-Shannon divergence.
Collaborative Evolutionary Reinforcement Learning
Collaborative Evolutionary Reinforcement Learning (CERL) is introduced, a scalable framework that comprises a portfolio of policies that simultaneously explore and exploit diverse regions of the solution space and significantly outperforms its composite learners while remaining overall more sample-efficient.
DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
This work proposes an algorithm, DualDICE, that is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset and improves accuracy compared to existing techniques.
Imitation Learning via Off-Policy Distribution Matching
This work shows how the original distribution ratio estimation objective may be transformed in a principled manner to yield a completely off-policy objective and calls the resulting algorithm ValueDICE, finding that it can achieve state-of-the-art sample efficiency and performance.
Diversity-Driven Exploration Strategy for Deep Reinforcement Learning
By simply adding a distance measure to the loss function, the proposed methodology significantly enhances an agent's exploratory behaviors, and thus preventing the policy from being trapped in local optima and an adaptive scaling method for stabilizing the learning process is proposed.
Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents
This paper shows that algorithms that have been invented to promote directed exploration in small-scale evolved neural networks via populations of exploring agents, specifically novelty search and quality diversity algorithms, can be hybridized with ES to improve its performance on sparse or deceptive deep RL tasks, while retaining scalability.
Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
A new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy, based on an extension of the doubly robust estimator and a new way to mix between model based estimates and importance sampling based estimates.
Stein Variational Policy Gradient
A novel Stein variational policy gradient method (SVPG) which combines existing policy gradient methods and a repulsive functional to generate a set of diverse but well-behaved policies is proposed.
Evolutionary Reinforcement Learning for Sample-Efficient Multiagent Coordination
This paper introduces MERL (Multiagent Evolutionary RL), a hybrid algorithm that does not require an explicit alignment between local and global objectives, and uses fast, policy-gradient based learning for each agent by utilizing their dense local rewards.
High-Dimensional Continuous Control Using Generalized Advantage Estimation
This work addresses the large number of samples typically required and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias.