Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective

@article{Everitt2021RewardTP,
  title={Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective},
  author={Tom Everitt and Marcus Hutter},
  journal={Synth.},
  year={2021},
  volume={198},
  pages={6435-6467}
}
Can an arbitrarily intelligent reinforcement learning agent be kept under control by a human user? Or do agents with sufficient intelligence inevitably find ways to shortcut their reward signal? This question impacts how far reinforcement learning can be scaled, and whether alternative paradigms must be developed in order to build safe artificial general intelligence. In this paper, we use an intuitive yet precise graphical model called causal influence diagrams to formalize reward tampering… 
Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings
TLDR
Modeling the agent-environment interaction in graphical models called influence diagrams can answer two fundamental questions about an agent's incentives directly from the graph to identify algorithms with problematic incentives and help in designing algorithms with better incentives.
On Agent Incentives to Manipulate Human Feedback in Multi-Agent Reward Learning Scenarios
TLDR
This work proposes a possible solution to manipulation of human feedback in this setting: the Shared Value Prior (SVP), which equips agents with an assumption that the reward functions of all humans are similar.
Learning to Incentivize Other Learning Agents
TLDR
This work proposes to equip each RL agent in a multi-agent environment with the ability to give rewards directly to other agents, using a learned incentive function, and demonstrates in experiments that such agents significantly outperform standard RL and opponent-shaping agents in challenging general-sum Markov games.
Pitfalls of learning a reward function online
TLDR
This work considers a continual ( ``one life'') learning approach where the agent both learns the reward function and optimises for it at the same time, and formally introduces two desirable properties: `unriggability', which prevents the agent from steering the learning process in the direction of a reward function that is easier to optimise.
Avoiding Tampering Incentives in Deep RL via Decoupled Approval
TLDR
This work presents a principled solution to the problem of learning from influenceable feedback, which combines approval with a decoupled feedback collection procedure and scales to complex 3D environments where tampering is possible.
Path-Specific Objectives for Safer Agent Incentives
We present a general framework for training safe agents whose naive incentives are unsafe. As an example, manipulative or deceptive behaviour can improve rewards but should be avoided. Most
Deception in Social Learning: A Multi-Agent Reinforcement Learning Perspective
TLDR
This research review introduces the problem statement, defines key concepts, critically evaluates existing evidence and addresses open problems that should be addressed in future research.
Agent Incentives: A Causal Perspective
TLDR
A framework for analysing agent incentives using causal influence diagrams is presented and a new graphical criterion for value of control is proposed, establishing its soundness and completeness and introducing two new concepts for incentive analysis.
Learning Performance Graphs from Demonstrations via Task-Based Evaluations
TLDR
An algorithm to learn the performance graph directly from the user-provided demonstrations, and show that the reward functions generated using the learned performance graph generate similar policies to those from manually specified performance graphs.
The Alignment Problem for Bayesian History-Based Reinforcement Learners∗
TLDR
This report categorizes sources of misalignment for reinforcement learning agents, illustrating each type with numerous examples, and describes ways to remove the source of misaligned agents.
...
...

References

SHOWING 1-10 OF 130 REFERENCES
Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings
TLDR
Modeling the agent-environment interaction in graphical models called influence diagrams can answer two fundamental questions about an agent's incentives directly from the graph to identify algorithms with problematic incentives and help in designing algorithms with better incentives.
Reinforcement Learning with a Corrupted Reward Channel
TLDR
This work formalises this problem as a generalised Markov Decision Problem called Corrupt Reward MDP, and finds that by using randomisation to blunt the agent's optimisation, reward corruption can be partially managed under some assumptions.
Interactively shaping agents via human reinforcement: the TAMER framework
TLDR
Results from two domains demonstrate that lay users can train TAMER agents without defining an environmental reward function (as in an MDP) and indicate that human training within the TAMER framework can reduce sample complexity over autonomous learning algorithms.
Inverse Reward Design
TLDR
This work introduces inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP, and introduces approximate methods for solving IRD problems, and uses their solution to plan risk-averse behavior in test MDPs.
Reward-rational (implicit) choice: A unifying formalism for reward learning
TLDR
This work provides two examples of how different types of behavior can be interpreted in a single unifying formalism - as a reward-rational choice that the human is making, often implicitly, in the search for a reward.
Cooperative Inverse Reinforcement Learning
TLDR
It is shown that computing optimal joint policies in CIRL games can be reduced to solving a POMDP, it is proved that optimality in isolation is suboptimal in C IRL, and an approximate CirL algorithm is derived.
Penalizing Side Effects using Stepwise Relative Reachability
TLDR
A new variant of the stepwise inaction baseline and a new deviation measure based on relative reachability of states are introduced that avoids the given undesirable incentives, while simpler baselines and the unreachability measure fail.
Scalable agent alignment via reward modeling: a research direction
TLDR
This work outlines a high-level research direction to solve the agent alignment problem centered around reward modeling: learning a reward function from interaction with the user and optimizing the learned reward function with reinforcement learning.
'Indifference' methods for managing agent rewards
TLDR
This paper classifies and analyses these methods on POMDPs, and establishes their uses, strengths, and limitations to make the tools of indifference generally accessible and usable to agent designers.
How RL Agents Behave When Their Actions Are Modified
TLDR
This work presents the Modi-Action Markov Decision Process, an extension of the MDP model that allows actions to differ from the policy, and analyzes the asymptotic behaviours of common reinforcement learning algorithms in this setting to show that they adapt in different ways.
...
...