Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
- Chelsea Finn, P. Abbeel, S. Levine
- Computer ScienceInternational Conference on Machine Learning
- 9 March 2017
We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning…
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
- Tuomas Haarnoja, Aurick Zhou, P. Abbeel, S. Levine
- Computer ScienceInternational Conference on Machine Learning
- 4 January 2018
This paper proposes soft actor-critic, an off-policy actor-Critic deep RL algorithm based on the maximum entropy reinforcement learning framework, and achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off- policy methods.
Trust Region Policy Optimization
- J. Schulman, S. Levine, P. Abbeel, Michael I. Jordan, Philipp Moritz
- Computer ScienceInternational Conference on Machine Learning
- 19 February 2015
A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).
High-Dimensional Continuous Control Using Generalized Advantage Estimation
- J. Schulman, Philipp Moritz, S. Levine, Michael I. Jordan, P. Abbeel
- Computer ScienceInternational Conference on Learning…
- 8 June 2015
This work addresses the large number of samples typically required and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias.
Soft Actor-Critic Algorithms and Applications
- Tuomas Haarnoja, Aurick Zhou, S. Levine
- Computer SciencearXiv.org
- 13 December 2018
Soft Actor-Critic (SAC), the recently introduced off-policy actor-critic algorithm based on the maximum entropy RL framework, achieves state-of-the-art performance, outperforming prior on-policy and off- policy methods in sample-efficiency and asymptotic performance.
Conservative Q-Learning for Offline Reinforcement Learning
- Aviral Kumar, Aurick Zhou, G. Tucker, S. Levine
- Computer ScienceNeural Information Processing Systems
- 8 June 2020
Conservative Q-learning (CQL) is proposed, which aims to address limitations of offline RL methods by learning a conservative Q-function such that the expected value of a policy under this Q- function lower-bounds its true value.
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
- Justin Fu, Aviral Kumar, Ofir Nachum, G. Tucker, S. Levine
- Computer SciencearXiv.org
- 15 April 2020
This work introduces benchmarks specifically designed for the offline setting, guided by key properties of datasets relevant to real-world applications of offline RL, and releases benchmark tasks and datasets with a comprehensive evaluation of existing algorithms and an evaluation protocol together with an open-source codebase.
Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
- Kurtland Chua, R. Calandra, R. McAllister, S. Levine
- Computer ScienceNeural Information Processing Systems
- 30 May 2018
This paper proposes a new algorithm called probabilistic ensembles with trajectory sampling (PETS) that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation, which matches the asymptotic performance of model-free algorithms on several challenging benchmark tasks, while requiring significantly fewer samples.
Reinforcement Learning with Deep Energy-Based Policies
- Tuomas Haarnoja, Haoran Tang, P. Abbeel, S. Levine
- Computer ScienceInternational Conference on Machine Learning
- 27 February 2017
A method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before, is proposed and a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution is applied.
Learning Robust Rewards with Adversarial Inverse Reinforcement Learning
It is demonstrated that AIRL is able to recover reward functions that are robust to changes in dynamics, enabling us to learn policies even under significant variation in the environment seen during training.
...
...