Corpus ID: 218900501

MOPO: Model-based Offline Policy Optimization

@article{Yu2020MOPOMO,
  title={MOPO: Model-based Offline Policy Optimization},
  author={Tianhe Yu and Garrett Thomas and Lantao Yu and Stefano Ermon and James Y. Zou and Sergey Levine and Chelsea Finn and Tengyu Ma},
  journal={ArXiv},
  year={2020},
  volume={abs/2005.13239}
}
Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a batch of previously collected data. This problem setting is compelling, because it offers the promise of utilizing large, diverse, previously collected datasets to acquire policies without any costly or dangerous active exploration, but it is also exceptionally difficult, due to the distributional shift between the offline training data and the learned policy. While there has been significant progress… Expand

Figures, Tables, and Topics from this paper

Representation Balancing Offline Model-based Reinforcement Learning
TLDR
This paper addresses the curse of horizon exhibited by RepBM, rejecting most of the pre-collected data in long-term tasks, and presents a new objective for model learning motivated by recent advances in the estimation of stationary distribution corrections, which effectively overcomes the aforementioned limitation of RepBM. Expand
NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learning
TLDR
A Near real-world offline RL benchmark is presented, named NeoRL, which contains datasets from various domains with controlled sizes, and extra test datasets for policy validation, and it is argued that the performance of a policy should also be compared with the deterministic version of the behavior policy, instead of the dataset reward. Expand
COMBO: Conservative Offline Model-Based Policy Optimization
TLDR
This work develops a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-action tuples generated via rollouts under the learned model, and finds that it consistently performs as well or better as compared to prior offline model-free and model- based methods on widely studied offline RL benchmarks, including image-based tasks. Expand
Near Real-World Benchmarks for Offline Reinforcement Learning
Offline reinforcement learning (RL) aims at learning an optimal policy from a batch of collected data, without extra interactions with the environment during training. Offline RL attempts toExpand
Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization
TLDR
A novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN), that can effectively optimize a policy offline using 10-20 times fewer data than prior works, and is able to achieve impressive deployment efficiency while maintaining the same or better sample efficiency. Expand
Representation Matters: Offline Pretraining for Sequential Decision Making
TLDR
Through a variety of experiments utilizing standard offline RL datasets, it is found that the use of pretraining with unsupervised learning objectives can dramatically improve the performance of policy learning algorithms that otherwise yield mediocre performance on their own. Expand
Reducing Conservativeness Oriented Offline Reinforcement Learning
TLDR
This paper proposes the method of reducing conservativeness oriented reinforcement learning, which is able to tackle the skewed distribution of the provided dataset and derive a value function closer to the expected value function. Expand
DeepAveragers: Offline Reinforcement Learning by Solving Derived Non-Parametric MDPs
TLDR
This work introduces the Deep Averagers with Costs MDP (DAC-MDP) and investigates its solutions for offline RL, a non-parametric model that can leverage deep representations and account for limited data by introducing costs for exploiting under-represented parts of the model. Expand
Risk-Averse Offline Reinforcement Learning
TLDR
The Offline RiskAverse Actor-Critic (O-RAAC), a model-free RL algorithm that is able to learn risk-averse policies in a fully offline setting, is presented and it is demonstrated empirically that in the presence of natural distribution-shifts, O- RAAC learns policies with good average performance. Expand
Regularized Behavior Value Estimation
TLDR
This work introduces Regularized Behavior Value Estimation (R-BVE), which estimates the value of the behavior policy during training and only performs policy improvement at deployment time, and uses a ranking regularisation term that favours actions in the dataset that lead to successful outcomes. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 77 REFERENCES
Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning
TLDR
This paper admits the use of data generated by arbitrary behavior policies and uses a learned prior -- the advantage-weighted behavior model (ABM) -- to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task. Expand
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
TLDR
This work introduces benchmarks specifically designed for the offline setting, guided by key properties of datasets relevant to real-world applications of offline RL, and releases benchmark tasks and datasets with a comprehensive evaluation of existing algorithms and an evaluation protocol together with an open-source codebase. Expand
Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
TLDR
This work develops a novel class of off-policy batch RL algorithms, able to effectively learn offline, without exploring, from a fixed batch of human interaction data, using models pre-trained on data as a strong prior, and uses KL-control to penalize divergence from this prior during RL training. Expand
Behavior Regularized Offline Reinforcement Learning
TLDR
A general framework, behavior regularized actor critic (BRAC), is introduced to empirically evaluate recently proposed methods as well as a number of simple baselines across a variety of offline continuous control tasks. Expand
Model-Ensemble Trust-Region Policy Optimization
TLDR
This paper analyzes the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and shows that the learned policy tends to exploit regions where insufficient data is available for the model to be learned, causing instability in training. Expand
Striving for Simplicity in Off-policy Deep Reinforcement Learning
TLDR
A simple and novel variant of ensemble Q-learning called Random Ensemble Mixture (REM), which enforces optimal Bellman consistency on random convex combinations of the Q-heads of a multi-head Q-network, is presented. Expand
Off-Policy Deep Reinforcement Learning without Exploration
TLDR
This paper introduces a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. Expand
Model-Based Reinforcement Learning via Meta-Policy Optimization
TLDR
This work proposes Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models and uses an ensemble of learned dynamic models to create a policy that can quickly adapt to any model in the ensemble with one policy gradient step. Expand
AlgaeDICE: Policy Gradient from Arbitrary Experience
TLDR
A new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an arbitrary behavior-agnostic and off-policy data distribution and shows that, if auxiliary dual variables of the objective are optimized, then the gradient of the off-Policy objective is exactly the on-policy policy gradient, without any use of importance weighting. Expand
Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction
TLDR
A practical algorithm, bootstrapping error accumulation reduction (BEAR), is proposed and it is demonstrated that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks. Expand
...
1
2
3
4
5
...