# One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL

@article{Kumar2020OneSI, title={One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL}, author={Saurabh Kumar and Aviral Kumar and Sergey Levine and Chelsea Finn}, journal={ArXiv}, year={2020}, volume={abs/2010.14484} }

While reinforcement learning algorithms can learn effective policies for complex tasks, these policies are often brittle to even minor task variations, especially when variations are not explicitly provided during training. One natural approach to this problem is to train agents with manually specified variation in the training task or environment. However, this may be infeasible in practical situations, either because making perturbations is not possible, or because it is unclear how to choose…

## 15 Citations

Learning a subspace of policies for online adaptation in Reinforcement Learning

- Computer ScienceArXiv
- 2021

This article considers the simplest yet hard to tackle generalization setting where the test environment is unknown at train time, forcing the agent to adapt to the system’s new dynamics and proposes an approach where a subspace of policies are learned within the parameter space.

Trajectory Diversity for Zero-Shot Coordination

- Computer ScienceAAMAS
- 2021

This work introduces Trajectory Diversity (TrajeDi) – a differentiable objective for generating diverse reinforcement learning policies and derive TrajeDi as a generalization of the Jensen-Shannon divergence between policies and motivate it experimentally in two simple settings.

Deep Reinforcement Learning amidst Continual Structured Non-Stationarity

- Computer ScienceICML
- 2021

This work leverages latent variable models to learn a representation of the environment from current and past experiences, and performs off-policy RL with this representation, and empirically finds that this approach substantially outperforms approaches that do not reason about environment shift.

Learning more skills through optimistic exploration

- Computer Science, MathematicsArXiv
- 2021

It is demonstrated empirically that DISDAIN improves skill learning both in a tabular grid world (Four Rooms) and the 57 games of the Atari Suite (from pixels) and is encouraged to treat pessimism with DIS DAIN.

Adaptable Agent Populations via a Generative Model of Policies

- Computer ScienceArXiv
- 2021

A generative model of policies for reinforcement learning is introduced, which maps a low-dimensional latent space to an agent policy space and enables learning an entire population of agent policies, without requiring the use of separate policy parameters.

A Simple Approach to Continual Learning by Transferring Skill Parameters

- Computer ScienceArXiv
- 2021

It is shown how to continually acquire robotic manipulation skills without forgetting, and using far fewer samples than needed to train them from scratch, given an appropriate curriculum.

Dynamics-Aware Quality-Diversity for Efficient Learning of Skill Repertoires

- Computer ScienceArXiv
- 2021

Dynamics-Aware Quality-Diversity (DA-QD), a framework to improve the sample efficiency of QD algorithms through the use of dynamics models, is proposed and shown how it can be used for continual acquisition of new skill repertoires.

Motion Planning by Learning the Solution Manifold in Trajectory Optimization

- Computer ScienceArXiv
- 2021

The approach can be interpreted as training a deep generative model of collision-free trajectories for motion planning and the experimental results indicate that the trained model represents an infinite set of homotopic solutions formotion planning problems.

Unpacking the Expressed Consequences of AI Research in Broader Impact Statements

- Computer ScienceAIES
- 2021

A qualitative thematic analysis of a sample of statements written for the NeurIPS 2020 conference identifies themes related to how consequences are expressed, areas of impacts expressed, and researchers' recommendations for mitigating negative consequences in the future.

Discovering Diverse Nearly Optimal Policies withSuccessor Features

- Computer Science, MathematicsArXiv
- 2021

Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while assuring that they are near optimal while remaining near-optimal with respect to the extrinsic reward of the MDP is proposed.

## References

SHOWING 1-10 OF 48 REFERENCES

Generalization in Reinforcement Learning with Selective Noise Injection and Information Bottleneck

- Computer Science, MathematicsNeurIPS
- 2019

This work proposes Selective Noise Injection (SNI), which maintains the regularizing effect the injected noise has, while mitigating the adverse effects it has on the gradient quality, and demonstrates that the Information Bottleneck is a particularly well suited regularization technique for RL as it is effective in the low-data regime encountered early on in training RL agents.

Generalization and Regularization in DQN

- Mathematics, Computer ScienceArXiv
- 2018

Despite regularization being largely underutilized in deep RL, it is shown that it can, in fact, help DQN learn more general features and can then be reused and fine-tuned on similar tasks, considerably improving the sample efficiency of D QN.

Meta-Reinforcement Learning of Structured Exploration Strategies

- Computer Science, MathematicsNeurIPS
- 2018

This work introduces a novel gradient-based fast adaptation algorithm -- model agnostic exploration with structured noise (MAESN) -- to learn exploration strategies from prior experience that are informed by prior knowledge and are more effective than random action-space noise.

Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning

- Computer Science, MathematicsICLR
- 2019

This work uses meta-learning to train a dynamics model prior such that, when combined with recent data, this prior can be rapidly adapted to the local context and demonstrates the importance of incorporating online adaptation into autonomous agents that operate in the real world.

Diversity is All You Need: Learning Skills without a Reward Function

- Computer ScienceICLR
- 2019

The proposed DIAYN ("Diversity is All You Need"), a method for learning useful skills without a reward function, learns skills by maximizing an information theoretic objective using a maximum entropy policy.

Worst Cases Policy Gradients

- Computer Science, MathematicsCoRL
- 2019

This work proposes an actor-critic framework that models the uncertainty of the future and simultaneously learns a policy based on that uncertainty model, and optimize policies for varying levels of conditional Value-at-Risk.

Dynamics-Aware Unsupervised Discovery of Skills

- Computer Science, MathematicsICLR
- 2020

This work proposes an unsupervised learning algorithm, Dynamics-Aware Discovery of Skills (DADS), which simultaneously discovers predictable behaviors and learns their dynamics, and demonstrates that zero-shot planning in the learned latent space significantly outperforms standard MBRL and model-free goal-conditioned RL, and substantially improves over prior hierarchical RL methods for unsuper supervised skill discovery.

Quantifying Generalization in Reinforcement Learning

- Computer Science, MathematicsICML
- 2019

It is shown that deeper convolutional architectures improve generalization, as do methods traditionally found in supervised learning, including L2 regularization, dropout, data augmentation and batch normalization.

Action Robust Reinforcement Learning and Applications in Continuous Control

- Computer Science, MathematicsICML
- 2019

This work formalizes two new criteria of robustness to action uncertainty and suggests algorithms in the tabular case that generalize the approach to deep reinforcement learning (DRL) and provides extensive experiments in the various MuJoCo domains.

Reinforcement Learning with Perturbed Rewards

- Computer Science, MathematicsAAAI
- 2020

This work develops a robust RL framework that enables agents to learn in noisy environments where only perturbed rewards are observed, and shows that trained policies based on the estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines.