• Corpus ID: 208192232

Policy Optimization by Local Improvement through Search

  title={Policy Optimization by Local Improvement through Search},
  author={Jialin Song and Joe Wenjie Jiang and Amir Yazdanbakhsh and Ebrahim M. Songhori and Anna Goldie and Navdeep Jaitly and Azalia Mirhoseini},
Imitation learning has emerged as a powerful strategy for learning initial policies that can be refined with reinforcement learning techniques. Most strategies in imitation learning, however, rely on per-step supervision either from expert demonstrations, referred to as behavioral cloning (Pomerleau, 1989; 1991) or from interactive expert policy queries such as DAgger (Ross et al., 2011). These strategies differ on the state distribution at which the expert actions are collected – the former… 

Figures from this paper

Efficient Reductions for Imitation Learning

This work proposes two alternative algorithms for imitation learning where training occurs over several episodes of interaction and shows that this leads to stronger performance guarantees and improved performance on two challenging problems: training a learner to play a 3D racing game and Mario Bros.


The new insight bridges the gap and interpolates between imitation learning and reinforcement learning and proposes Truncated HORizon Policy Search (THOR), a method that focuses on searching for policies that maximize the total reshaped reward over a finite planning horizon when the oracle is sub-optimal.

Generative Adversarial Imitation Learning

A new general framework for directly extracting a policy from data, as if it were obtained by reinforcement learning following inverse reinforcement learning, is proposed and a certain instantiation of this framework draws an analogy between imitation learning and generative adversarial networks.

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

This paper proposes a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online learning setting and demonstrates that this new approach outperforms previous approaches on two challenging imitation learning problems and a benchmark sequence labeling problem.

Model-Based Reinforcement Learning via Meta-Policy Optimization

This work proposes Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models and uses an ensemble of learned dynamic models to create a policy that can quickly adapt to any model in the ensemble with one policy gradient step.

Dual Policy Iteration

This work studies this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provides a convergence analysis that extends existing API theory, and develops a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using learned local models.

Thinking Fast and Slow with Deep Learning and Tree Search

This paper presents Expert Iteration (ExIt), a novel reinforcement learning algorithm which decomposes the problem into separate planning and generalisation tasks, and shows that ExIt outperforms REINFORCE for training a neural network to play the board game Hex, and the final tree search agent, trained tabula rasa, defeats MoHex 1.0.

On the sample complexity of reinforcement learning.

Novel algorithms with more restricted guarantees are suggested whose sample complexities are again independent of the size of the state space and depend linearly on the complexity of the policy class, but have only a polynomial dependence on the horizon time.

DART: Noise Injection for Robust Imitation Learning

A new algorithm is proposed, DART (Disturbances for Augmenting Robot Trajectories), that collects demonstrations with injected noise, and optimizes the noise level to approximate the error of the robot's trained policy during data collection.

Proximal Policy Optimization Algorithms

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective