Efficient Offline Policy Optimization with a Learned Model

  title={Efficient Offline Policy Optimization with a Learned Model},
  author={Zi-Yan Liu and Siyi Li and Wee Sun Lee and Shuicheng Yan and Zhongwen Xu},
MuZero Unplugged presents a promising approach for offline policy learning from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data. For good performance, MCTS requires accurate learned models and a large number of simulations, thus costing huge computing time. This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline RL settings, including 1) learning with limited… 

Figures and Tables from this paper



Online and Offline Reinforcement Learning by Planning with a Learned Model

The Reanalyse algorithm is described, which uses modelbased policy and value improvement operators to compute new improved training targets on existing data points, allowing efficient learning for data budgets varying by several orders of magnitude.

Conservative Q-Learning for Offline Reinforcement Learning

Conservative Q-learning (CQL) is proposed, which aims to address limitations of offline RL methods by learning a conservative Q-function such that the expected value of a policy under this Q- function lower-bounds its true value.

The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract)

The promise of ALE is illustrated by developing and benchmarking domain-independent agents designed using well-established AI techniques for both reinforcement learning and planning, and an evaluation methodology made possible by ALE is proposed.

Caglar Gulcehre, Nicolas Heess, et al. Critic regularized regression

  • NeurIPS
  • 2020

On the role of planning in model-based deep reinforcement learning

This paper studies the performance of MuZero, a state-of-the-art model-based reinforcement learning algorithm with strong connections and overlapping components with many other MBRL algorithms, and suggests that planning alone is insufficient to drive strong generalization.

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

The MuZero algorithm is presented, which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics.

Regularized Behavior Value Estimation

This work introduces Regularized Behavior Value Estimation (R-BVE), which estimates the value of the behavior policy during training and only performs policy improvement at deployment time, and uses a ranking regularisation term that favours actions in the dataset that lead to successful outcomes.

Monte-Carlo Tree Search as Regularized Policy Optimization

This paper proposes a variant of AlphaZero which uses the exact solution to this policy optimization problem, and shows experimentally that it reliably outperforms the original algorithm in multiple domains.

Observe and Look Further: Achieving Consistent Performance on Atari

This paper proposes an algorithm that addresses three key challenges that any algorithm needs to master in order to perform well on all games: processing diverse reward distributions, reasoning over long time horizons, and exploring efficiently.

Trust Region Policy Optimization

A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).