Apprenticeship learning via inverse reinforcement learning

  title={Apprenticeship learning via inverse reinforcement learning},
  author={P. Abbeel and A. Ng},
  journal={Proceedings of the twenty-first international conference on Machine learning},
  • P. AbbeelA. Ng
  • Published 4 July 2004
  • Computer Science
  • Proceedings of the twenty-first international conference on Machine learning
We consider learning in a Markov decision process where we are not explicitly given a reward function, but where instead we can observe an expert demonstrating the task that we want to learn to perform. This setting is useful in applications (such as the task of driving) where it may be difficult to write down an explicit reward function specifying exactly how different desiderata should be traded off. We think of the expert as trying to maximize a reward function that is expressible as a… 

Figures and Tables from this paper

Stochastic convex optimization for provably efficient apprenticeship learning

A computationally efficient algorithm is developed and high confidence regret bounds are derived on the quality of the extracted policy, utilizing results from stochastic convex optimization and recent works in approximate linear programming for solving forward MDPs.

Exploration and apprenticeship learning in reinforcement learning

This paper considers the apprenticeship learning setting in which a teacher demonstration of the task is available, and shows that, given the initial demonstration, no explicit exploration is necessary, and the student can attain near-optimal performance simply by repeatedly executing "exploitation policies" that try to maximize rewards.

Apprenticeship learning via soft local homomorphisms

This paper proposes to use a transfer method, known as soft homomorphism, in order to generalize the expert's policy to unvisited regions of the state space, which can be used either as the robot's final policy, or to calculate the features frequencies within an IRL algorithm.

Bootstrapping Apprenticeship Learning

The quality of the learned policies is highly sensitive to the error in estimating the feature counts, and a novel approach is introduced for bootstrapping the demonstration by assuming that the expert is (near-)optimal, and the dynamics of the system is known.

Balancing Sample Efficiency and Suboptimality in Inverse Reinforcement Learning

A new model-free IRL method that is remarkably able to autonomously negotiate a trade-off between the error induced on the learned policy when potentially choosing a sub-optimal reward, and the estimation error caused by using samples in the forward learning phase, which can be controlled by explicitly optimizing also the discount factor of the related learning problem.

Apprenticeship learning with few examples

Identifiability and generalizability from multiple experts in Inverse Reinforcement Learning

This work provides conditions characterizing when data on multiple experts in a given environment allows to generalize and train an optimal agent in a new environment, and characterize reward identifiability in the case where the reward function can be represented as a linear combination of given features, making it more interpretable, or when it has access to approximate transition matrices.

Apprenticeship Learning via Frank-Wolfe

This work shows that a variation of the Frank-Wolfe (FW) method that is based on taking “away steps” achieves a linear rate of convergence when applied to AL and that a stochastic version of the FW algorithm can be used to avoid precise estimation of feature expectations.

Compatible Reward Inverse Reinforcement Learning

A novel model-free IRL approach that, differently from most of the existing IRL algorithms, does not require to specify a function space where to search for the expert's reward function.

Inverse Reinforcement Learning via Matching of Optimality Profiles

This work proposes an algorithm that learns a reward function from demonstrations together with a weak supervision signal in the form of a distribution over rewards collected during the demonstrations, and shows that the method is capable of learning reward functions such that policies trained to optimize them outperform the demonstrations used for fitting the reward functions.



Robot Learning From Demonstration

This work has shown that incorporating a task level direct learning component, which is non-model-based, in addition to the model-based planner, is useful in compensating for structural modeling errors and slow model learning.

Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping

Conditions under which modi cations to the reward function of a Markov decision process preserve the op timal policy are investigated to shed light on the practice of reward shap ing a method used in reinforcement learn ing whereby additional training rewards are used to guide the learning agent.

Formation and control of optimal trajectory in human multijoint arm movement

The idea that the human hand trajectory is planned and controlled in accordance with the minimum torquechange criterion is supported by developing an iterative scheme, with which the optimal trajectory and the associated motor command are simultaneously computed.

Learning movement sequences from demonstration

  • R. AmitM. Matari
  • Computer Science, Biology
    Proceedings 2nd International Conference on Development and Learning. ICDL 2002
  • 2002
Presents a control and learning architecture for humanoid robots designed for acquiring movement skills in the context of imitation learning, and uses the notion of visuo-motor primitives, modules capable of recognizing as well as executing similar movements.

Linear Programming and Sequential Decisions

Using an illustration drawn from the area of inventory control, this paper demonstrates how a typical sequential probabilistic model may be formulated in terms of a an initial decision rule and b a

An organizing principle for a class of voluntary movements

  • N. Hogan
  • Biology
    The Journal of neuroscience : the official journal of the Society for Neuroscience
  • 1984
This paper presents a mathematical model which predicts both the major qualitative features and, within experimental error, the quantitative details of a class of perturbed and unperturbed

Learning by watching: extracting reusable task knowledge from visual observation of human performance

A novel task instruction method for future intelligent robots that learns reusable task plans by watching a human perform assembly tasks is presented, which results in a hierarchical task plan describing the higher level structure of the task.

ALVINN: An Autonomous Land Vehicle in a Neural Network

ALVINN (Autonomous Land Vehicle In a Neural Network) is a 3-layer back-propagation network designed for the task of road following that can effectively follow real roads under certain field conditions.

The Nature of Statistical Learning Theory

  • V. Vapnik
  • Computer Science
    Statistics for Engineering and Information Science
  • 2000
Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing

Pharmacokinetics of a novel formulation of ivermectin after administration to goats

The commercial formulation used in this study is a good option to consider when administering ivermectin to goats because of the high absorption, which is characterized by high values of F.