• Corpus ID: 3307812

Reinforcement Learning from Imperfect Demonstrations

  title={Reinforcement Learning from Imperfect Demonstrations},
  author={Yang Gao and Huazhe Xu and Ji Lin and Fisher Yu and Sergey Levine and Trevor Darrell},
Robust real-world learning should benefit from both demonstrations and interactions with the environment. Current approaches to learning from demonstration and reward perform supervised learning on expert demonstration data and use reinforcement learning to further improve performance based on the reward received from the environment. These tasks have divergent losses which are difficult to jointly optimize and such methods can be very sensitive to noisy demonstrations. We propose a unified… 

Figures from this paper

Reinforcement Learning with Supervision from Noisy Demonstrations
Experimental results in various environments with multiple popular reinforcement learning algorithms show that the proposed approach can learn robustly with noisy demonstrations, and achieve higher performance in fewer iterations.
Demonstration actor critic
Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations
A novel reward-learning-from-observation algorithm, Trajectory-ranked Reward EXtrapolation (T-REX), that extrapolates beyond a set of (approximately) ranked demonstrations in order to infer high-quality reward functions from a setof potentially poor demonstrations.
Anomaly Guided Policy Learning from Imperfect Demonstrations
This work focuses on bridging the exploration and LfID problems in view of anomaly detection, and further proposes AGPO method to deal with these problems, and shows the superiority of AGPO in this scenario.
Reward Relabelling for combined Reinforcement and Imitation Learning on sparse-reward tasks
This work presents a new method, able to leverage demonstrations and episodes collected online in any sparse-reward environment with any off-policy algorithm, based on a reward bonus given to demonstrations and successful episodes, encouraging expert imitation and self-imitation.
Self-Imitation Learning from Demonstrations
Self-Imitation Learning (SIL), a recent RL algorithm that exploits agent’s past good experience, is extended to the LfD setup by initializing its replay buffer with demonstrations and shows the superiority of SIL over existing L fD algorithms in settings of suboptimal demonstrations and sparse rewards.
Interactive Reinforcement Learning from Demonstration and Human Evaluative Feedback
This paper proposes a model-based method-IRL-TAMER-by combining learning from demonstration via inverse reinforcement learning (IRL) and learning from human reward via the TAMER framework and suggests that although an agent learning via IRL can learn a useful value function indicating which states are good based on the demonstration, it cannot obtain an effective policy navigating to the goal state with one demonstration.
SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards
This work proposes a simple alternative that still uses RL, but does not require learning a reward function, and can be implemented with a handful of minor modifications to any standard Q-learning or off-policy actor-critic algorithm, called soft Q imitation learning (SQIL).
Shaping Rewards for Reinforcement Learning with Imperfect Demonstrations using Generative Models
This work proposes a method that combines reinforcement and imitation learning by shaping the reward function with a state-and-action-dependent potential that is trained from demonstration data, using a generative model.
PsiPhi-Learning: Reinforcement Learning with Demonstrations using Successor Features and Inverse Temporal Difference Learning
A multi-task inverse reinforcement learning (IRL) algorithm is proposed, called inverse temporal difference learning (ITD), that learns shared state features, alongside peragent successor features and preference vectors, purely from demonstrations without reward labels.


Learning from Demonstrations for Real World Reinforcement Learning
This paper presents an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages this data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstrationData while learning thanks to a prioritized replay mechanism.
Deep Q-learning From Demonstrations
This paper presents an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstrating data and is able to automatically assess the necessary ratio of demonstrationData while learning thanks to a prioritized replay mechanism.
Reinforcement Learning with Unsupervised Auxiliary Tasks
This paper significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10$\times$ and averaging 87\% Expert human performance on Labyrinth.
Integrating reinforcement learning with human demonstrations of varying ability
This work introduces Human-Agent Transfer (HAT), an algorithm that combines transfer learning, learning from demonstration and reinforcement learning to achieve rapid learning and high performance in
Exploration from Demonstration for Interactive Reinforcement Learning
This work presents a model-free policy-based approach called Exploration from Demonstration (EfD) that uses human demonstrations to guide search space exploration and shows how EfD scales to large problems and provides convergence speed-ups over traditional exploration and interactive learning methods.
Robust Imitation of Diverse Behaviors
A new version of GAIL is developed that is much more robust than the purely-supervised controller, especially with few demonstrations, and avoids mode collapse, capturing many diverse behaviors when GAIL on its own does not.
Learning from Limited Demonstrations
This work proves an upper bound on the Bellman error of the estimate computed by APID at each iteration, and shows empirically that APID outperforms pure Approximate Policy Iteration, a state-of-the-art LfD algorithm, and supervised learning in a variety of scenarios, including when very few and/or suboptimal demonstrations are available.
Boosted Bellman Residual Minimization Handling Expert Demonstrations
This paper addresses the problem of batch Reinforcement Learning with Expert Demonstrations (RLED) by proposing algorithms that leverage expert data to find an optimal policy of a Markov Decision Process (MDP), using a data set of fixed sampled transitions of the MDP as well as a dataSet of fixed expert demonstrations.
Loss is its own Reward: Self-Supervision for Reinforcement Learning
This work considers a range of self-supervised tasks that incorporate states, actions, and successors to provide auxiliary losses that offer ubiquitous and instantaneous supervision for representation learning even in the absence of reward.
Generative Adversarial Imitation Learning
A new general framework for directly extracting a policy from data, as if it were obtained by reinforcement learning following inverse reinforcement learning, is proposed and a certain instantiation of this framework draws an analogy between imitation learning and generative adversarial networks.