• Corpus ID: 241032813

Curriculum Offline Imitation Learning

  title={Curriculum Offline Imitation Learning},
  author={Minghuan Liu and Hanye Zhao and Zhengyu Yang and Jian Shen and Weinan Zhang and Li Zhao and Tie-Yan Liu},
Offline reinforcement learning (RL) tasks require the agent to learn from a precollected dataset with no further interactions with the environment. Despite the potential to surpass the behavioral policies, RL-based methods are generally impractical due to the training instability and bootstrapping the extrapolation errors, which always require careful hyperparameter tuning via online evaluation. In contrast, offline imitation learning (IL) has no such issues since it learns the policy directly… 



MOPO: Model-based Offline Policy Optimization

A new model-based offline RL algorithm is proposed that applies the variance of a Lipschitz-regularized model as a penalty to the reward function, and it is found that this algorithm outperforms both standard model- based RL methods and existing state-of-the-art model-free offline RL approaches on existing offline RL benchmarks, as well as two challenging continuous control tasks.

On Value Discrepancy of Imitation Learning

A framework to analyze the theoretical property of imitation learning approaches based on discrepancy propagation analysis implies that GAIL has less compounding errors than behavioral cloning, which is verified empirically in this paper and indicated that the proposed framework is a general tool to analyze imitationLearning approaches.

NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learning

A Near real-world offline RL benchmark is presented, named NeoRL, which contains datasets from various domains with controlled sizes, and extra test datasets for policy validation, and it is argued that the performance of a policy should also be compared with the deterministic version of the behavior policy, instead of the dataset reward.

MOReL : Model-Based Offline Reinforcement Learning

Theoretically, it is shown that MOReL is minimax optimal (up to log factors) for offline RL, and through experiments, it matches or exceeds state-of-the-art results in widely studied offline RL benchmarks.

Efficient Reductions for Imitation Learning

This work proposes two alternative algorithms for imitation learning where training occurs over several episodes of interaction and shows that this leads to stronger performance guarantees and improved performance on two challenging problems: training a learner to play a 3D racing game and Mario Bros.

BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning

This work proposes a new algorithm, Best-Action Imitation Learning (BAIL), which learns a V function, uses the V function to select actions it believes to be high-performing, and then uses those actions to train a policy network using imitation learning.

A Divergence Minimization Perspective on Imitation Learning Methods

A unified probabilistic perspective on IL algorithms based on divergence minimization is presented, conclusively identifying that IRL's state-marginal matching objective contributes most to its superior performance, and applies the new understanding of IL methods to the problem of state-Marginal matching.

Generative Adversarial Imitation Learning

A new general framework for directly extracting a policy from data, as if it were obtained by reinforcement learning following inverse reinforcement learning, is proposed and a certain instantiation of this framework draws an analogy between imitation learning and generative adversarial networks.

Exponentially Weighted Imitation Learning for Batched Historical Data

A monotonic advantage reweighted imitation learning strategy that is applicable to problems with complex nonlinear function approximation and works well with hybrid (discrete and continuous) action space and can be used to learn from data generated by an unknown policy.

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

This paper proposes a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online learning setting and demonstrates that this new approach outperforms previous approaches on two challenging imitation learning problems and a benchmark sequence labeling problem.