Imitation Learning via Differentiable Physics

  title={Imitation Learning via Differentiable Physics},
  author={Siwei Chen and Xiao Ma and Zhongwen Xu},
Existing imitation learning (IL) methods such as inverse reinforcement learning (IRL) usually have a double-loop training process, alternating between learning a reward function and a policy and tend to suffer long training time and high variance. In this work, we identify the benefits of differentiable physics simulators and propose a new IL method, i.e., Imitation Learning via Differentiable Physics (ILD), which gets rid of the double-loop design and achieves significant improvements in final… 

Figures and Tables from this paper


Primal Wasserstein Imitation Learning
PWIL is proposed, which ties to the primal form of the Wasserstein distance between the expert and the agent state-action distributions, and presents a reward function which is derived offline, as opposed to recent adversarial IL algorithms that learn a rewarded function through interactions with the environment, and which requires little fine-tuning.
Efficient Reductions for Imitation Learning
This work proposes two alternative algorithms for imitation learning where training occurs over several episodes of interaction and shows that this leads to stronger performance guarantees and improved performance on two challenging problems: training a learner to play a 3D racing game and Mario Bros.
OPIRL: Sample Efficient Off-Policy Inverse Reinforcement Learning via Distribution Matching
Off-Policy Inverse Reinforcement Learning (OPIRL) is presented, which adopts off-policy data distribution instead of on-policy and enables reduction of the number of interactions with the environment and learns a reward function that is transferable with high generalization capabilities on changing dynamics.
Relative Entropy Inverse Reinforcement Learning
This paper proposes a model-free IRL algorithm, where the relative entropy between the empirical distribution of the state-action trajectories under a baseline policy and their distribution under the learned policy is minimized by stochastic gradient descent.
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
This paper proposes a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online learning setting and demonstrates that this new approach outperforms previous approaches on two challenging imitation learning problems and a benchmark sequence labeling problem.
Model-Augmented Actor-Critic: Backpropagating through Paths
This paper builds a policy optimization algorithm that uses the pathwise derivative of the learned model and policy across future timesteps, and matches the asymptotic performance of model-free algorithms, and scales to long horizons, a regime where typically past model-based approaches have struggled.
Generative Adversarial Imitation Learning
A new general framework for directly extracting a policy from data, as if it were obtained by reinforcement learning following inverse reinforcement learning, is proposed and a certain instantiation of this framework draws an analogy between imitation learning and generative adversarial networks.
Off-Policy Imitation Learning from Observations
This work proposes a sample-efficient LfO approach which enables off-policy optimization in a principled manner and indicates that this approach is comparable with state-of-the-art in terms of both sample-efficiency and asymptotic performance.
Learning Robust Rewards with Adversarial Inverse Reinforcement Learning
It is demonstrated that AIRL is able to recover reward functions that are robust to changes in dynamics, enabling us to learn policies even under significant variation in the environment seen during training.
PlasticineLab: A Soft-Body Manipulation Benchmark with Differentiable Physics
A new differentiable physics benchmark called PasticineLab is introduced, which includes a diverse collection of soft body manipulation tasks, and experimental results suggest that RL-based approaches struggle to solve most of the tasks efficiently and gradient- based approaches can rapidly find a solution within tens of iterations, but still fall short on multi-stage tasks that require long-term planning.