Path-Specific Objectives for Safer Agent Incentives

  title={Path-Specific Objectives for Safer Agent Incentives},
  author={Sebastian Farquhar and Ryan Carey and Tom Everitt},
We present a general framework for training safe agents whose naive incentives are unsafe. As an example, manipulative or de-ceptive behaviour can improve rewards but should be avoided. Most approaches fail here: agents maximize expected return by any means necessary. We formally describe settings with ‘delicate’ parts of the state which should not be used as a means to an end. We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the… 

Figures and Tables from this paper

A Complete Criterion for Value of Information in Soluble Influence Diagrams
Influence diagrams have recently been used to analyse the safety and fairness properties of AI systems. A key building block for this analysis is a graphical criterion for value of information (VoI).


Agent Incentives: A Causal Perspective
A framework for analysing agent incentives using causal influence diagrams is presented and a new graphical criterion for value of control is proposed, establishing its soundness and completeness and introducing two new concepts for incentive analysis.
Avoiding Tampering Incentives in Deep RL via Decoupled Approval
This work presents a principled solution to the problem of learning from influenceable feedback, which combines approval with a decoupled feedback collection procedure and scales to complex 3D environments where tampering is possible.
Avoiding Side Effects By Considering Future Tasks
This work formally defines interference incentives and shows that the future task approach with a baseline policy avoids these incentives in the deterministic case and is more effective for avoiding side effects than the common approach of penalizing irreversible actions.
Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective
This paper uses an intuitive yet precise graphical model called causal influence diagrams to formalize reward tampering problems, and describes a number of modifications to the reinforcement learning objective that prevent incentives for reward tampering.
Scalable agent alignment via reward modeling: a research direction
This work outlines a high-level research direction to solve the agent alignment problem centered around reward modeling: learning a reward function from interaction with the user and optimizing the learned reward function with reinforcement learning.
Pitfalls of learning a reward function online
This work considers a continual ( ``one life'') learning approach where the agent both learns the reward function and optimises for it at the same time, and formally introduces two desirable properties: `unriggability', which prevents the agent from steering the learning process in the direction of a reward function that is easier to optimise.
Representing and Solving Decision Problems with Limited Information
An algorithm for improving any given strategy by local computation of single policy updates and investigate conditions for the resulting strategy to be optimal is given.
Hidden Incentives for Auto-Induced Distributional Shift
The term auto-induced distributional shift (ADS) is introduced to describe the phenomenon of an algorithm causing a change in the distribution of its own inputs, to ensure that machine learning systems do not leverage ADS to increase performance when doing so could be undesirable.
Estimation of Personalized Effects Associated With Causal Pathways
A variety of methods for learning high quality policies of this type from data are derived, in a causal model corresponding to a longitudinal setting of practical importance, via a dataset of HIV patients undergoing therapy, gathered in the Nigerian PEPFAR program.
Deep Reinforcement Learning from Human Preferences
This work explores goals defined in terms of (non-expert) human preferences between pairs of trajectory segments in order to effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion.