Apprenticeship learning via inverse reinforcement learning

@article{Abbeel2004ApprenticeshipLV,
  title={Apprenticeship learning via inverse reinforcement learning},
  author={P. Abbeel and A. Ng},
  journal={Proceedings of the twenty-first international conference on Machine learning},
  year={2004}
}
  • P. Abbeel, A. Ng
  • Published 4 July 2004
  • Computer Science
  • Proceedings of the twenty-first international conference on Machine learning
We consider learning in a Markov decision process where we are not explicitly given a reward function, but where instead we can observe an expert demonstrating the task that we want to learn to perform. This setting is useful in applications (such as the task of driving) where it may be difficult to write down an explicit reward function specifying exactly how different desiderata should be traded off. We think of the expert as trying to maximize a reward function that is expressible as a… 
Stochastic convex optimization for provably efficient apprenticeship learning
TLDR
A computationally efficient algorithm is developed and high confidence regret bounds are derived on the quality of the extracted policy, utilizing results from stochastic convex optimization and recent works in approximate linear programming for solving forward MDPs.
Exploration and apprenticeship learning in reinforcement learning
TLDR
This paper considers the apprenticeship learning setting in which a teacher demonstration of the task is available, and shows that, given the initial demonstration, no explicit exploration is necessary, and the student can attain near-optimal performance simply by repeatedly executing "exploitation policies" that try to maximize rewards.
Apprenticeship learning via soft local homomorphisms
TLDR
This paper proposes to use a transfer method, known as soft homomorphism, in order to generalize the expert's policy to unvisited regions of the state space, which can be used either as the robot's final policy, or to calculate the features frequencies within an IRL algorithm.
Bootstrapping Apprenticeship Learning
TLDR
The quality of the learned policies is highly sensitive to the error in estimating the feature counts, and a novel approach is introduced for bootstrapping the demonstration by assuming that the expert is (near-)optimal, and the dynamics of the system is known.
Apprenticeship learning with few examples
TLDR
The quality of the learned policies is sensitive to the error in estimating the averages of the features when the dynamics of the system is stochastic, and two new approaches for bootstrapping the demonstrations are introduced.
Compatible Reward Inverse Reinforcement Learning
TLDR
A novel model-free IRL approach that, differently from most of the existing IRL algorithms, does not require to specify a function space where to search for the expert's reward function.
Inverse Reinforcement Learning via Matching of Optimality Profiles
TLDR
This work proposes an algorithm that learns a reward function from demonstrations together with a weak supervision signal in the form of a distribution over rewards collected during the demonstrations, and shows that the method is capable of learning reward functions such that policies trained to optimize them outperform the demonstrations used for fitting the reward functions.
Inverse Reinforcement Learning from a Gradient-based Learner
TLDR
This paper proposes a new algorithm for inverse Reinforcement Learning, in which the goal is to recover the reward function being optimized by an agent, given a sequence of policies produced during learning.
Inverse Reinforcement Learning with Multiple Ranked Experts
TLDR
This work considers the problem of learning to behave optimally in a Markov Decision Process when a reward function is not specified, but instead the authors have access to a set of demonstrators of varying performance, and uses ideas from ordinal regression to find a rewarded function that maximizes the margin between the different ranks.
Relative Entropy Inverse Reinforcement Learning
TLDR
This paper proposes a model-free IRL algorithm, where the relative entropy between the empirical distribution of the state-action trajectories under a baseline policy and their distribution under the learned policy is minimized by stochastic gradient descent.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 20 REFERENCES
Robot Learning From Demonstration
TLDR
This work has shown that incorporating a task level direct learning component, which is non-model-based, in addition to the model-based planner, is useful in compensating for structural modeling errors and slow model learning.
Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping
TLDR
Conditions under which modi cations to the reward function of a Markov decision process preserve the op timal policy are investigated to shed light on the practice of reward shap ing a method used in reinforcement learn ing whereby additional training rewards are used to guide the learning agent.
Formation and control of optimal trajectory in human multijoint arm movement
TLDR
The idea that the human hand trajectory is planned and controlled in accordance with the minimum torquechange criterion is supported by developing an iterative scheme, with which the optimal trajectory and the associated motor command are simultaneously computed.
Learning movement sequences from demonstration
  • R. Amit, M. Matari
  • Computer Science
    Proceedings 2nd International Conference on Development and Learning. ICDL 2002
  • 2002
TLDR
Presents a control and learning architecture for humanoid robots designed for acquiring movement skills in the context of imitation learning, and uses the notion of visuo-motor primitives, modules capable of recognizing as well as executing similar movements.
Linear Programming and Sequential Decisions
Using an illustration drawn from the area of inventory control, this paper demonstrates how a typical sequential probabilistic model may be formulated in terms of a an initial decision rule and b a
Algorithms for Inverse Reinforcement Learning
TLDR
Pharmacokinetics of ivermectin after IV administration were best described by a 2-compartment open model; values for main compartmental variables included volume of distribution at a steady state, area under the plasma concentration-time curve, and area underThe AUC curve.
An organizing principle for a class of voluntary movements
  • N. Hogan
  • Mathematics, Medicine
    The Journal of neuroscience : the official journal of the Society for Neuroscience
  • 1984
This paper presents a mathematical model which predicts both the major qualitative features and, within experimental error, the quantitative details of a class of perturbed and unperturbed
Learning by watching: extracting reusable task knowledge from visual observation of human performance
TLDR
A novel task instruction method for future intelligent robots that learns reusable task plans by watching a human perform assembly tasks is presented, which results in a hierarchical task plan describing the higher level structure of the task.
Statistical learning theory
TLDR
Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
ALVINN: An Autonomous Land Vehicle in a Neural Network
TLDR
ALVINN (Autonomous Land Vehicle In a Neural Network) is a 3-layer back-propagation network designed for the task of road following that can effectively follow real roads under certain field conditions.
...
1
2
...