Conservative Agency via Attainable Utility Preservation

@article{Turner2020ConservativeAV,
  title={Conservative Agency via Attainable Utility Preservation},
  author={Alexander Matt Turner and Dylan Hadfield-Menell and Prasad Tadepalli},
  journal={Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society},
  year={2020}
}
Reward functions are easy to misspecify; although designers can make corrections after observing mistakes, an agent pursuing a misspecified reward function can irreversibly change the state of its environment. [...] Key Method To mitigate this risk, we introduce an approach that balances optimization of the primary reward function with preservation of the ability to optimize auxiliary reward functions. Surprisingly, even when the auxiliary reward functions are randomly generated and therefore uninformative…Expand
Avoiding Negative Side-Effects and Promoting Safe Exploration with Imaginative Planning
With the recent proliferation of the usage of reinforcement learning (RL) agents for solving real-world tasks, safety emerges as a necessary ingredient for their successful application. In thisExpand
Penalizing Side Effects using Stepwise Relative Reachability
TLDR
A new variant of the stepwise inaction baseline and a new deviation measure based on relative reachability of states are introduced that avoids the given undesirable incentives, while simpler baselines and the unreachability measure fail. Expand
Avoiding Side Effects By Considering Future Tasks
TLDR
This work formally defines interference incentives and shows that the future task approach with a baseline policy avoids these incentives in the deterministic case and is more effective for avoiding side effects than the common approach of penalizing irreversible actions. Expand
Challenges for Using Impact Regularizers to Avoid Negative Side Effects
TLDR
The main current challenges of impact regularizers are examined and they are related to fundamental design decisions and promising directions to overcome the unsolved challenges in preventing negative side effects with impact regularizer are explored. Expand
Optimal Farsighted Agents Tend to Seek Power
TLDR
This work formalizes a notion of power within the context of finite Markov decision processes (MDPs) and suggests that farsighted optimal policies tend to seek power over the environment. Expand
Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective
TLDR
This paper uses an intuitive yet precise graphical model called causal influence diagrams to formalize reward tampering problems, and describes a number of modifications to the reinforcement learning objective that prevent incentives for reward tampering. Expand
Be Considerate: Objectives, Side Effects, and Deciding How to Act
TLDR
This work contends that to learn to act safely, a reinforcement learning (RL) agent should include contemplation of the impact of its actions on the wellbeing and agency of others in the environment, including other acting agents and reactive processes, as well as providing different criteria for characterizing impact. Expand
Towards AGI Agent Safety by Iteratively Improving the Utility Function
TLDR
An AGI safety layer is presented that creates a special dedicated input terminal to support the iterative improvement of an AGI agent’s utility function. Expand
Safety Aware Reinforcement Learning (SARL)
TLDR
This work proposes Safety Aware Reinforcement Learning (SARL) - a framework where a virtual safe agent modulates the actions of a main reward-based agent to minimize side effects and shows that the solution is able to match the performance of solutions that rely on task-specific side-effect penalties on both the primary and safety objectives. Expand
Benefits of Assistance over Reward Learning
Much recent work has focused on how an agent can learn what to do from human feedback, leading to two major paradigms. The first paradigm is reward learning, in which the agent learns a reward modelExpand
...
1
2
...

References

SHOWING 1-10 OF 69 REFERENCES
Inverse Reward Design
TLDR
This work introduces inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP, and introduces approximate methods for solving IRD problems, and uses their solution to plan risk-averse behavior in test MDPs. Expand
AI Safety Gridworlds
We present a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. These problems include safe interruptibility, avoiding side effects, absentExpand
Robust Policy Computation in Reward-Uncertain MDPs Using Nondominated Policies
TLDR
This work develops new techniques for the robust optimization of IR-MDPs, using the minimax regret decision criterion, that exploit the set of nondominated policies, i.e., policies that are optimal for some instantiation of the imprecise reward function. Expand
Penalizing Side Effects using Stepwise Relative Reachability
TLDR
A new variant of the stepwise inaction baseline and a new deviation measure based on relative reachability of states are introduced that avoids the given undesirable incentives, while simpler baselines and the unreachability measure fail. Expand
Measuring and avoiding side effects using relative reachability
TLDR
A general definition of side effects is introduced, based on relative reachability of states compared to a default state, that avoids these undesirable incentives in tasks that require irreversible actions and in environments that contain sources of change other than the agent. Expand
Quantilizers: A Safer Alternative to Maximizers for Limited Optimization
  • Jessica Taylor
  • Computer Science
  • AAAI Workshop: AI, Ethics, and Society
  • 2016
TLDR
This paper describes an alternative to expected utility maximization for powerful AI systems, which is called expected utility quantilization, which could allow the construction of AI systems that do not necessarily fall into strange and unanticipated shortcuts and edge cases in pursuit of their goals. Expand
Reinforcement Learning with a Corrupted Reward Channel
TLDR
This work formalises this problem as a generalised Markov Decision Problem called Corrupt Reward MDP, and finds that by using randomisation to blunt the agent's optimisation, reward corruption can be partially managed under some assumptions. Expand
Incorrigibility in the CIRL Framework
TLDR
It is argued that it is important to consider systems that follow shutdown commands under some weaker set of assumptions (e.g., that one small verified module is correctly implemented; as opposed to an entire prior probability distribution and/or parameterized reward function) in a value learning framework. Expand
Minimax-Regret Querying on Side Effects for Safe Optimality in Factored Markov Decision Processes
TLDR
A planning algorithm is developed that avoids potentially negative side effects given what the agent knows about (un)changeable features and a provably minimax-regret querying strategy is formulated for the agent to selectively ask the user about features that it hasn't explicitly been told about. Expand
Cooperative Inverse Reinforcement Learning
TLDR
It is shown that computing optimal joint policies in CIRL games can be reduced to solving a POMDP, it is proved that optimality in isolation is suboptimal in C IRL, and an approximate CirL algorithm is derived. Expand
...
1
2
3
4
5
...