• Corpus ID: 244954755

CoMPS: Continual Meta Policy Search

@article{Berseth2021CoMPSCM,
  title={CoMPS: Continual Meta Policy Search},
  author={Glen Berseth and Zhiwei Zhang and Grace H. Zhang and Chelsea Finn and Sergey Levine},
  journal={ArXiv},
  year={2021},
  volume={abs/2112.04467}
}
, analogously to PPO and other importance-sampled policy gradient algorithms. We use this estimator for the inner loop update in Algorithm 1 line 5. We show in our ablation experiments that this approach is needed to enable successful meta-training using the exhaustive off-policy experience collected by CoMPS. 
1 Citations
Task-Agnostic Continual Reinforcement Learning: In Praise of a Simple Baseline
TLDR
Evidence that 3RL’s outperformance stems from its ability to quickly infer how new tasks relate with the previous ones, enabling forward transfer is laid out, by analyzing different training statistics including gradient conflict.

References

SHOWING 1-10 OF 74 REFERENCES
Proximal Policy Optimization Algorithms
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective
Evolved Policy Gradients
TLDR
Empirical results show that the evolved policy gradient algorithm (EPG) achieves faster learning on several randomized environments compared to an off-the-shelf policy gradient method, and its learned loss can generalize to out-of-distribution test time tasks, and exhibits qualitatively different behavior from other popular metalearning algorithms.
ProMP: Proximal Meta-Policy Search
TLDR
A novel meta-learning algorithm is developed that overcomes both the issue of poor credit assignment and previous difficulties in estimating meta-policy gradients and leads to superior pre-adaptation policy behavior and consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance.
Guided Meta-Policy Search
TLDR
This paper proposes to learn a reinforcement learning procedure through imitation of expert policies that solve previously-seen tasks, and demonstrates significant improvements in meta-RL sample efficiency in comparison to prior work as well as the ability to scale to domains with visual observations.
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables
TLDR
This paper develops an off-policy meta-RL algorithm that disentangles task inference and control and performs online probabilistic filtering of latent task variables to infer how to solve a new task from small amounts of experience.
Some Considerations on Learning to Explore via Meta-Reinforcement Learning
TLDR
E-MAML and E-$\text{RL}^2$ deliver better performance on tasks where exploration is important and are presented on a novel environment called `Krazy World' and a set of maze environments.
Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction
TLDR
A practical algorithm, bootstrapping error accumulation reduction (BEAR), is proposed and it is demonstrated that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks.
VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning
TLDR
This paper introduces variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn to perform approximate inference in an unknown environment, and incorporate task uncertainty directly during action selection and achieves higher online return than existing methods.
Adaptive Gradient-Based Meta-Learning Methods
TLDR
This approach enables the task-similarity to be learned adaptively, provides sharper transfer-risk bounds in the setting of statistical learning-to-learn, and leads to straightforward derivations of average-case regret bounds for efficient algorithms in settings where thetask-environment changes dynamically or the tasks share a certain geometric structure.
Offline Meta Reinforcement Learning
TLDR
A Bayesian RL (BRL) view is taken, and the recently proposed VariBAD BRL algorithm is extended to the off-policy setting, and learning of Bayes-optimal exploration strategies from offline data using deep neural networks is demonstrated.
...
...