# Inverse Reward Design

@inproceedings{HadfieldMenell2017InverseRD, title={Inverse Reward Design}, author={Dylan Hadfield-Menell and Smitha Milli and P. Abbeel and Stuart J. Russell and Anca D. Dragan}, booktitle={NIPS}, year={2017} }

Autonomous agents optimize the reward function we give them. [...] Key Method We introduce inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP. We introduce approximate methods for solving IRD problems, and use their solution to plan risk-averse behavior in test MDPs. Empirical results suggest that this approach can help alleviate negative side effects of misspecified reward functions and mitigate reward hacking. Expand

## 191 Citations

Assisted Robust Reward Design

- Computer ScienceArXiv
- 2021

An Ass Reward Design method is contributed that speeds up the design process by anticipating and influencing this future evidence: rather than letting the designer eventually encounter failure cases and revise the reward, the method actively exposes the designer to such environments during the development phase.

Active Inverse Reward Design

- Mathematics, Computer ScienceArXiv
- 2018

This work uses inverse reward design (IRD) (Hadfield-Menell et al., 2017) to update the distribution over thetrue reward function from the observed proxy reward function chosen by the designer, and finds that this approach not only decreases the uncertainty about the true reward, but also greatly improves performance in unseen environments while only querying for reward functions in a single training environment.

Choice Set Misspecification in Reward Inference

- Computer ScienceAISafety@IJCAI
- 2020

This work introduces the idea that the choice set itself might be difficult to specify, and analyzes choice set misspecification: what happens as the robot makes incorrect assumptions about the set of choices from which the human selects their feedback.

Simplifying Reward Design through Divide-and-Conquer

- Computer Science, EngineeringRobotics: Science and Systems
- 2018

It is found that independent reward design outperforms the standard, joint, reward design process but works best when the design problem can be divided into simpler subproblems.

INFERRING REWARD FUNCTIONS

- 2018

Our goal is to infer reward functions from demonstrations. In order to infer the correct reward function, we must account for the systematic ways in which the demonstrator is suboptimal. Prior work…

Programmatic Reward Design by Example

- Computer ScienceArXiv
- 2021

A probabilistic framework that can infer the best candidate programmatic reward function from expert demonstrations, and enable RL agents to achieve state-of-the-art performance on highly complex tasks.

Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning

- Computer ScienceJournal of Artificial Intelligence Research
- 2022

This paper proposes reward machines, a type of finite state machine that supports the specification of reward functions while exposing reward function structure, and describes different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning.

Admissible Policy Teaching through Reward Design

- Computer ScienceArXiv
- 2022

This paper shows that the reward design problem for admissible policy teaching is computationally challenging, and it is NP-hard to find an approximately optimal reward modification, and formulates a surrogate problem whose optimal solution approximates the optimal solution to the rewardDesign problem in this setting, but is more amenable to optimization techniques and analysis.

On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference

- Computer Science, MathematicsICML
- 2019

Mixed findings suggest that at least for the foreseeable future, agents need a middle ground between the flexibility of data-driven methods and the useful bias of known human biases.

Pitfalls of learning a reward function online

- Computer ScienceArXiv
- 2020

This work considers a continual ( ``one life'') learning approach where the agent both learns the reward function and optimises for it at the same time, and formally introduces two desirable properties: `unriggability', which prevents the agent from steering the learning process in the direction of a reward function that is easier to optimise.

## References

SHOWING 1-10 OF 30 REFERENCES

Reward Design via Online Gradient Ascent

- Computer ScienceNIPS
- 2010

This work develops a gradient ascent approach with formal convergence guarantees for approximately solving the optimal reward problem online during an agent's lifetime and demonstrates its ability to improve reward functions in agents with various forms of limitations.

Cooperative Inverse Reinforcement Learning

- Computer ScienceNIPS
- 2016

It is shown that computing optimal joint policies in CIRL games can be reduced to solving a POMDP, it is proved that optimality in isolation is suboptimal in C IRL, and an approximate CirL algorithm is derived.

A Game-Theoretic Approach to Apprenticeship Learning

- Computer ScienceNIPS
- 2007

A new algorithm is given that is computationally faster, is easier to implement, and can be applied even in the absence of an expert, and it is shown that this algorithm may produce a policy that is substantially better than the expert's.

Where Do Rewards Come From

- Psychology, Computer Science
- 2009

A general computa- tional framework for reward is advanced that places it in an evolutionary context, formulating a notion of an optimal reward function given a fitness function and some distribution of environments.

Learning preferences for manipulation tasks from online coactive feedback

- Computer ScienceInt. J. Robotics Res.
- 2015

This work proposes a coactive online learning framework for teaching preferences in contextually rich environments, and implements its algorithm on two high-degree-of-freedom robots, PR2 and Baxter, and presents three intuitive mechanisms for providing incremental feedback.

Learning the Preferences of Ignorant, Inconsistent Agents

- Computer ScienceAAAI
- 2016

A behavioral experiment in which human subjects perform preference inference given the same observations of choices as the model is presented, showing that human subjects explain choices in terms of systematic deviations from optimal behavior and suggesting that they take such deviations into account when inferring preferences.

Shared Autonomy via Hindsight Optimization

- Computer Science, MedicineRobotics: Science and Systems
- 2015

The problem of shared autonomy is formulated as a Partially Observable Markov Decision Process with uncertainty over the user's goal, and maximum entropy inverse optimal control is utilized to estimate a distribution over the users' goal based on the history of inputs.

Maximum Entropy Inverse Reinforcement Learning

- Computer ScienceAAAI
- 2008

A probabilistic approach based on the principle of maximum entropy that provides a well-defined, globally normalized distribution over decision sequences, while providing the same performance guarantees as existing methods is developed.

The Off-Switch Game

- Computer ScienceIJCAI
- 2017

It is concluded that giving machines an appropriate level of uncertainty about their objectives leads to safer designs, and it is argued that this setting is a useful generalization of the classical AI paradigm of rational agents.

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

- Computer Science, MathematicsArXiv
- 2016

This paper proposes to represent a "fast" reinforcement learning algorithm as a recurrent neural network (RNN) and learn it from data, encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm.