Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective

@article{Everitt2021RewardTP,
  title={Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective},
  author={Tom Everitt and Marcus Hutter},
  journal={Synth.},
  year={2021},
  volume={198},
  pages={6435-6467}
}
Can an arbitrarily intelligent reinforcement learning agent be kept under control by a human user? Or do agents with sufficient intelligence inevitably find ways to shortcut their reward signal? This question impacts how far reinforcement learning can be scaled, and whether alternative paradigms must be developed in order to build safe artificial general intelligence. In this paper, we use an intuitive yet precise graphical model called causal influence diagrams to formalize reward tampering… 

Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings

TLDR
Modeling the agent-environment interaction in graphical models called influence diagrams can answer two fundamental questions about an agent's incentives directly from the graph to identify algorithms with problematic incentives and help in designing algorithms with better incentives.

On Agent Incentives to Manipulate Human Feedback in Multi-Agent Reward Learning Scenarios

TLDR
This work proposes a possible solution to manipulation of human feedback in this setting: the Shared Value Prior (SVP), which equips agents with an assumption that the reward functions of all humans are similar.

Learning to Incentivize Other Learning Agents

TLDR
This work proposes to equip each RL agent in a multi-agent environment with the ability to give rewards directly to other agents, using a learned incentive function, and demonstrates in experiments that such agents significantly outperform standard RL and opponent-shaping agents in challenging general-sum Markov games.

Pitfalls of learning a reward function online

TLDR
This work considers a continual ( ``one life'') learning approach where the agent both learns the reward function and optimises for it at the same time, and formally introduces two desirable properties: `unriggability', which prevents the agent from steering the learning process in the direction of a reward function that is easier to optimise.

Advanced artificial agents intervene in the provision of reward

TLDR
A fully formal idealized agent that makes observations that inform it about its goal is found that it can never disambiguate the message from the referent, and it is discussed some recent approaches to avoiding this problem.

Avoiding Tampering Incentives in Deep RL via Decoupled Approval

TLDR
This work presents a principled solution to the problem of learning from influenceable feedback, which combines approval with a decoupled feedback collection procedure and scales to complex 3D environments where tampering is possible.

Path-Specific Objectives for Safer Agent Incentives

We present a general framework for training safe agents whose naive incentives are unsafe. As an example, manipulative or deceptive behaviour can improve rewards but should be avoided. Most

Deception in Social Learning: A Multi-Agent Reinforcement Learning Perspective

TLDR
This research review introduces the problem statement, defines key concepts, critically evaluates existing evidence and addresses open problems that should be addressed in future research.

Agent Incentives: A Causal Perspective

TLDR
A framework for analysing agent incentives using causal influence diagrams is presented and a new graphical criterion for value of control is proposed, establishing its soundness and completeness and introducing two new concepts for incentive analysis.

Learning Performance Graphs from Demonstrations via Task-Based Evaluations

TLDR
An algorithm to learn the performance graph directly from the user-provided demonstrations, and show that the reward functions generated using the learned performance graph generate similar policies to those from manually specified performance graphs.

References

SHOWING 1-10 OF 116 REFERENCES

Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings

TLDR
Modeling the agent-environment interaction in graphical models called influence diagrams can answer two fundamental questions about an agent's incentives directly from the graph to identify algorithms with problematic incentives and help in designing algorithms with better incentives.

Interactively shaping agents via human reinforcement: the TAMER framework

TLDR
Results from two domains demonstrate that lay users can train TAMER agents without defining an environmental reward function (as in an MDP) and indicate that human training within the TAMER framework can reduce sample complexity over autonomous learning algorithms.

Inverse Reward Design

TLDR
This work introduces inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP, and introduces approximate methods for solving IRD problems, and uses their solution to plan risk-averse behavior in test MDPs.

Reward-rational (implicit) choice: A unifying formalism for reward learning

TLDR
This work provides two examples of how different types of behavior can be interpreted in a single unifying formalism - as a reward-rational choice that the human is making, often implicitly, in the search for a reward.

Cooperative Inverse Reinforcement Learning

TLDR
It is shown that computing optimal joint policies in CIRL games can be reduced to solving a POMDP, it is proved that optimality in isolation is suboptimal in C IRL, and an approximate CirL algorithm is derived.

Penalizing Side Effects using Stepwise Relative Reachability

TLDR
A new variant of the stepwise inaction baseline and a new deviation measure based on relative reachability of states are introduced that avoids the given undesirable incentives, while simpler baselines and the unreachability measure fail.

Scalable agent alignment via reward modeling: a research direction

TLDR
This work outlines a high-level research direction to solve the agent alignment problem centered around reward modeling: learning a reward function from interaction with the user and optimizing the learned reward function with reinforcement learning.

'Indifference' methods for managing agent rewards

TLDR
This paper classifies and analyses these methods on POMDPs, and establishes their uses, strengths, and limitations to make the tools of indifference generally accessible and usable to agent designers.

Learning What to Value

I. J. Good's intelligence explosion theory predicts that ultraintelligent agents will undergo a process of repeated self-improvement; in the wake of such an event, how well our values are fulfilled

How RL Agents Behave When Their Actions Are Modified

TLDR
The Modified-Action Markov Decision Process is presented, an extension of the MDP model that allows actions to differ from the policy and analyzes the asymptotic behaviours of common reinforcement learning algorithms in this setting and shows that they adapt in different ways.
...