Sample-Efficient Reinforcement Learning via Conservative Model-Based Actor-Critic

  title={Sample-Efficient Reinforcement Learning via Conservative Model-Based Actor-Critic},
  author={Zhihai Wang and Jie Wang and Qi Zhou and Bin Li and Houqiang Li},
Model-based reinforcement learning algorithms, which aim to learn a model of the environment to make decisions, are more sample efficient than their model-free counterparts. The sample efficiency of model-based approaches relies on whether the model can well approximate the environment. However, learning an accurate model is challenging, especially in complex and noisy environments. To tackle this problem, we propose the conservative model-based actor-critic (CMBAC), a novel approach that… 

Learning Task-relevant Representations for Generalization via Characteristic Functions of Reward Sequence Distributions

Experiments demonstrate that CRESP significantly improves the performance of generalization on unseen environments, outperforming several state-of-the-arts on DeepMind Control tasks with different visual distractions.



Deep Model-Based Reinforcement Learning via Estimated Uncertainty and Conservative Policy Optimization

An upper bound of the uncertainty is derived based on which an uncertainty-aware policy optimization algorithm that optimizes the policy conservatively to encourage performance improvement with high probability can significantly alleviate the overfitting of policy to inaccurate models.

On the model-based stochastic value gradient for continuous reinforcement learning

This paper surpasses the asymptotic performance of other model-based methods on the proprioceptive MuJoCo locomotion tasks from the OpenAI gym, including a humanoid, and achieves these results with a simple deterministic world model without requiring an ensemble.

Model-Based Reinforcement Learning via Meta-Policy Optimization

This work proposes Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models and uses an ensemble of learned dynamic models to create a policy that can quickly adapt to any model in the ensemble with one policy gradient step.

Model-Augmented Actor-Critic: Backpropagating through Paths

This paper builds a policy optimization algorithm that uses the pathwise derivative of the learned model and policy across future timesteps, and matches the asymptotic performance of model-free algorithms, and scales to long horizons, a regime where typically past model-based approaches have struggled.

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

This paper proposes a new algorithm called probabilistic ensembles with trajectory sampling (PETS) that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation, which matches the asymptotic performance of model-free algorithms on several challenging benchmark tasks, while requiring significantly fewer samples.

Model-Ensemble Trust-Region Policy Optimization

This paper analyzes the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and shows that the learned policy tends to exploit regions where insufficient data is available for the model to be learned, causing instability in training.

COMBO: Conservative Offline Model-Based Policy Optimization

A new model-based offline RL algorithm, COMBO, is developed that trains a value function using both the offline dataset and data generated using rollouts under the model while also additionally regularizing the value function on out-of-support state-action tuples generated via model rollouts, without requiring explicit uncertainty estimation.

Algorithmic Framework for Model-based Reinforcement Learning with Theoretical Guarantees

A novel algorithmic framework for designing and analyzing model-based RL algorithms with theoretical guarantees is introduced and a meta-algorithm with a theoretical guarantee of monotone improvement to a local maximum of the expected reward is designed.

Conservative Q-Learning for Offline Reinforcement Learning

Conservative Q-learning (CQL) is proposed, which aims to address limitations of offline RL methods by learning a conservative Q-function such that the expected value of a policy under this Q- function lower-bounds its true value.

When to Trust Your Model: Model-Based Policy Optimization

This paper first formulate and analyze a model-based reinforcement learning algorithm with a guarantee of monotonic improvement at each step, and demonstrates that a simple procedure of using short model-generated rollouts branched from real data has the benefits of more complicated model- based algorithms without the usual pitfalls.