# MOPO: Model-based Offline Policy Optimization

@article{Yu2020MOPOMO, title={MOPO: Model-based Offline Policy Optimization}, author={Tianhe Yu and Garrett Thomas and Lantao Yu and Stefano Ermon and James Y. Zou and Sergey Levine and Chelsea Finn and Tengyu Ma}, journal={ArXiv}, year={2020}, volume={abs/2005.13239} }

Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a batch of previously collected data. This problem setting is compelling, because it offers the promise of utilizing large, diverse, previously collected datasets to acquire policies without any costly or dangerous active exploration, but it is also exceptionally difficult, due to the distributional shift between the offline training data and the learned policy. While there has been significant progress… Expand

#### 106 Citations

Representation Balancing Offline Model-based Reinforcement Learning

- Computer Science
- ICLR
- 2021

This paper addresses the curse of horizon exhibited by RepBM, rejecting most of the pre-collected data in long-term tasks, and presents a new objective for model learning motivated by recent advances in the estimation of stationary distribution corrections, which effectively overcomes the aforementioned limitation of RepBM. Expand

NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learning

- Computer Science
- ArXiv
- 2021

A Near real-world offline RL benchmark is presented, named NeoRL, which contains datasets from various domains with controlled sizes, and extra test datasets for policy validation, and it is argued that the performance of a policy should also be compared with the deterministic version of the behavior policy, instead of the dataset reward. Expand

COMBO: Conservative Offline Model-Based Policy Optimization

- Computer Science
- ArXiv
- 2021

This work develops a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-action tuples generated via rollouts under the learned model, and finds that it consistently performs as well or better as compared to prior offline model-free and model- based methods on widely studied offline RL benchmarks, including image-based tasks. Expand

Near Real-World Benchmarks for Offline Reinforcement Learning

- 2021

Offline reinforcement learning (RL) aims at learning an optimal policy from a batch of collected data, without extra interactions with the environment during training. Offline RL attempts to… Expand

Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

- Computer Science, Mathematics
- ICLR
- 2021

A novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN), that can effectively optimize a policy offline using 10-20 times fewer data than prior works, and is able to achieve impressive deployment efficiency while maintaining the same or better sample efficiency. Expand

Representation Matters: Offline Pretraining for Sequential Decision Making

- Computer Science
- ICML
- 2021

Through a variety of experiments utilizing standard offline RL datasets, it is found that the use of pretraining with unsupervised learning objectives can dramatically improve the performance of policy learning algorithms that otherwise yield mediocre performance on their own. Expand

Reducing Conservativeness Oriented Offline Reinforcement Learning

- Computer Science
- ArXiv
- 2021

This paper proposes the method of reducing conservativeness oriented reinforcement learning, which is able to tackle the skewed distribution of the provided dataset and derive a value function closer to the expected value function. Expand

DeepAveragers: Offline Reinforcement Learning by Solving Derived Non-Parametric MDPs

- Computer Science, Mathematics
- ICLR
- 2021

This work introduces the Deep Averagers with Costs MDP (DAC-MDP) and investigates its solutions for offline RL, a non-parametric model that can leverage deep representations and account for limited data by introducing costs for exploiting under-represented parts of the model. Expand

Risk-Averse Offline Reinforcement Learning

- Computer Science
- ICLR
- 2021

The Offline RiskAverse Actor-Critic (O-RAAC), a model-free RL algorithm that is able to learn risk-averse policies in a fully offline setting, is presented and it is demonstrated empirically that in the presence of natural distribution-shifts, O- RAAC learns policies with good average performance. Expand

Regularized Behavior Value Estimation

- Computer Science
- ArXiv
- 2021

This work introduces Regularized Behavior Value Estimation (R-BVE), which estimates the value of the behavior policy during training and only performs policy improvement at deployment time, and uses a ranking regularisation term that favours actions in the dataset that lead to successful outcomes. Expand

#### References

SHOWING 1-10 OF 77 REFERENCES

Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning

- Computer Science, Mathematics
- ICLR
- 2020

This paper admits the use of data generated by arbitrary behavior policies and uses a learned prior -- the advantage-weighted behavior model (ABM) -- to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task. Expand

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

- Computer Science, Mathematics
- ArXiv
- 2020

This work introduces benchmarks specifically designed for the offline setting, guided by key properties of datasets relevant to real-world applications of offline RL, and releases benchmark tasks and datasets with a comprehensive evaluation of existing algorithms and an evaluation protocol together with an open-source codebase. Expand

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

- Computer Science, Mathematics
- ArXiv
- 2019

This work develops a novel class of off-policy batch RL algorithms, able to effectively learn offline, without exploring, from a fixed batch of human interaction data, using models pre-trained on data as a strong prior, and uses KL-control to penalize divergence from this prior during RL training. Expand

Behavior Regularized Offline Reinforcement Learning

- Computer Science, Mathematics
- ArXiv
- 2019

A general framework, behavior regularized actor critic (BRAC), is introduced to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks. Expand

Model-Ensemble Trust-Region Policy Optimization

- Computer Science, Mathematics
- ICLR
- 2018

This paper analyzes the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and shows that the learned policy tends to exploit regions where insufficient data is available for the model to be learned, causing instability in training. Expand

Striving for Simplicity in Off-policy Deep Reinforcement Learning

- Computer Science, Mathematics
- ArXiv
- 2019

A simple and novel variant of ensemble Q-learning called Random Ensemble Mixture (REM), which enforces optimal Bellman consistency on random convex combinations of the Q-heads of a multi-head Q-network, is presented. Expand

Off-Policy Deep Reinforcement Learning without Exploration

- Computer Science, Mathematics
- ICML
- 2019

This paper introduces a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. Expand

Model-Based Reinforcement Learning via Meta-Policy Optimization

- Computer Science, Mathematics
- CoRL
- 2018

This work proposes Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models and uses an ensemble of learned dynamic models to create a policy that can quickly adapt to any model in the ensemble with one policy gradient step. Expand

AlgaeDICE: Policy Gradient from Arbitrary Experience

- Computer Science
- ArXiv
- 2019

A new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution and shows that, if auxiliary dual variables of the objective are optimized, then the gradient of the off-Policy objective is exactly the on-policy policy gradient, without any use of importance weighting. Expand

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

- Computer Science, Mathematics
- NeurIPS
- 2019

A practical algorithm, bootstrapping error accumulation reduction (BEAR), is proposed and it is demonstrated that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks. Expand