• Corpus ID: 221586337

Importance Weighted Policy Learning and Adaption

@article{Galashov2020ImportanceWP,
  title={Importance Weighted Policy Learning and Adaption},
  author={Alexandre Galashov and Jakub Sygnowski and Guillaume Desjardins and Jan Humplik and Leonard Hasenclever and Rae Jeong and Yee Whye Teh and Nicolas Manfred Otto Heess},
  journal={ArXiv},
  year={2020},
  volume={abs/2009.04875}
}
The ability to exploit prior experience to solve novel problems rapidly is a hallmark of biological learning systems and of great practical importance for artificial ones. In the meta reinforcement learning literature much recent work has focused on the problem of optimizing the learning process itself. In this paper we study a complementary approach which is conceptually simple, general, modular and built on top of recent improvements in off-policy learning. The framework is inspired by ideas… 
2 Citations

Figures from this paper

How to Spend Your Robot Time: Bridging Kickstarting and Offline Reinforcement Learning for Vision-based Robotic Manipulation
TLDR
This work develops two RL algorithms that can speed up training by using not only the action distributions of teacher policies, but also data collected by such policies on the task at hand, and investigates ways to minimize online interactions in a target task, by reusing a suboptimal policy.
Collect & Infer - a fresh look at data-efficient Reinforcement Learning
TLDR
This position paper proposes a fresh look at Reinforcement Learning from the perspective of data-efficiency, and explicitly models RL as two separate but interconnected processes, concerned with data collection and knowledge inference respectively, via a paradigm that it is argued can only be achieved through careful consideration of both aspects.

References

SHOWING 1-10 OF 46 REFERENCES
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables
TLDR
This paper develops an off-policy meta-RL algorithm that disentangles task inference and control and performs online probabilistic filtering of latent task variables to infer how to solve a new task from small amounts of experience.
Relative Entropy Regularized Policy Iteration
TLDR
An off-policy actor-critic algorithm for Reinforcement Learning (RL) that combines ideas from gradient-free optimization via stochastic search with learned action-value function and can be seen either as an extension of the Maximum a Posteriori Policy Optimisation algorithm (MPO) or as an addition to a policy iteration scheme.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
TLDR
This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods.
Rl2: Fast reinforcement learning via slow reinforcement learning, 2016
  • 2016
Information asymmetry in KL-regularized RL
TLDR
This work starts from the KL regularized expected reward objective and introduces an additional component, a default policy, but crucially restricts the amount of information the default policy receives, forcing it to learn reusable behaviors that help the policy learn faster.
Distral: Robust multitask reinforcement learning
TLDR
This work proposes a new approach for joint training of multiple tasks, which it refers to as Distral (Distill & transfer learning), and shows that the proposed learning process is more robust and more stable---attributes that are critical in deep reinforcement learning.
Compositional Transfer in Hierarchical Reinforcement Learning
TLDR
Regularized Hierarchical Policy Optimization (RHPO) is introduced to improve data-efficiency for domains with multiple dominant tasks and ultimately reduce required platform time and demonstrates substantial data- efficiency and final performance gains over competitive baselines in a week-long, physical robot stacking experiment.
Q-Learning in enormous action spaces via amortized approximate maximization
TLDR
The resulting approach, which is dubbed Amortized Q-learning (AQL), is able to handle discrete, continuous, or hybrid action spaces while maintaining the benefits of Q- learning.
VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning
TLDR
This paper introduces variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn to perform approximate inference in an unknown environment, and incorporate task uncertainty directly during action selection and achieves higher online return than existing methods.
Meta-Q-Learning
TLDR
Experiments on standard continuous-control benchmarks suggest that MQL compares favorably with the state of the art in meta-RL, and draws upon ideas in propensity estimation to do so and thereby amplifies the amount of available data for adaptation.
...
...