• Corpus ID: 225103394

Low-Variance Policy Gradient Estimation with World Models

@article{Nauman2020LowVariancePG,
  title={Low-Variance Policy Gradient Estimation with World Models},
  author={Michal Nauman and Floris den Hengst},
  journal={ArXiv},
  year={2020},
  volume={abs/2010.15622}
}
In this paper, we propose World Model Policy Gradient (WMPG), an approach to reduce the variance of policy gradient estimates using learned world models (WM's). In WMPG, a WM is trained online and used to imagine trajectories. The imagined trajectories are used in two ways. Firstly, to calculate a without-replacement estimator of the policy gradient. Secondly, the return of the imagined trajectories is used as an informed baseline. We compare the proposed approach with AC and MAC on a set of… 
1 Citations

Figures and Tables from this paper

P OLICY IMPROVEMENT BY PLANNING WITH G UMBEL

Gumbel AlphaZero and Gumbel MuZero, respectively without and with model-learning, match the state of the art on Go, chess, and Atari, and significantly improve prior performance when planning with few simulations.

References

SHOWING 1-10 OF 40 REFERENCES

Policy Gradient Methods for Reinforcement Learning with Function Approximation

This paper proves for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.

Value Prediction Network

This paper proposes a novel deep reinforcement learning architecture, called Value Prediction Network (VPN), which integrates model-free and model-based RL methods into a single neural network, which outperforms Deep Q-Network on several Atari games even with short-lookahead planning.

Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning

By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, this work improves value estimation, which, in turn, reduces the sample complexity of learning.

SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning

This paper presents a method for learning representations that are suitable for iterative model-based policy improvement, even when the underlying dynamical system has complex dynamics and image observations, in that these representations are optimized for inferring simple dynamics and cost models given data from the current policy.

Deep Variational Reinforcement Learning for POMDPs

Deep variational reinforcement learning (DVRL) is proposed, which introduces an inductive bias that allows an agent to learn a generative model of the environment and perform inference in that model to effectively aggregate the available information.

Trust Region Policy Optimization

A method for optimizing control policies, with guaranteed monotonic improvement, by making several approximations to the theoretically-justified scheme, called Trust Region Policy Optimization (TRPO).

Model-Based Reinforcement Learning for Atari

Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models, is described and a comparison of several model architectures is presented, including a novel architecture that yields the best results in the authors' setting.

A Natural Policy Gradient

This work provides a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space and shows drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris.

Recurrent World Models Facilitate Policy Evolution

A generative recurrent neural network is quickly trained in an unsupervised manner to model popular reinforcement learning environments through compressed spatio-temporal representations. The world

Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation

This work proposes to apply trust region optimization to deep reinforcement learning using a recently proposed Kronecker-factored approximation to the curvature with trust region, which is the first scalable trust region natural gradient method for actor-critic methods.