• Corpus ID: 227261694

Optimal Policies Tend To Seek Power

  title={Optimal Policies Tend To Seek Power},
  author={Alexander Matt Turner and Logan Smith and Rohin Shah and Andrew Critch and Prasad Tadepalli},
Some researchers have speculated that capable reinforcement learning agents are often incentivized to seek resources and power in pursuit of their objectives. While seeking power in order to optimize a misspecified objective, agents might be incentivized to behave in undesirable ways, including rationally preventing deactivation and correction. Others have voiced skepticism: human power-seeking instincts seem idiosyncratic, and these urges need not be present in reinforcement learning agents… 

Figures from this paper

Defining and Characterizing Reward Hacking

We provide the first formal definition of reward hacking , a phenomenon where optimizing an imperfect proxy reward function, ˜ R , leads to poor performance according to the true reward function, R .

Learning Altruistic Behaviours in Reinforcement Learning without External Rewards

This work proposes an altruistic agent that learns to increase the choices another agent has by preferring to maximize the number of states that the other agent can reach in its future.

Parametrically Retargetable Decision-Makers Tend To Seek Power

This work discovers that many decision-making functions are retargetable, and that retargetability is sufficient to cause power-seeking tendencies, and shows that a range of qualitatively dissimilar decision- making procedures incentivize agents to seek power.

Formalizing the Problem of Side Effect Regularization

This work proposes a formal criterion for side effect regularization via the assistance game framework and shows that this POMDP is solved by trading off the proxy reward with the agent’s ability to achieve a range of future tasks.

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

The field of AI alignment is concerned with AI systems that pursue unintended goals. One commonly studied mechanism by which an unintended goal might arise is specification gaming , in which the

The alignment problem from a deep learning perspective

Within the coming decades, artificial general intelligence (AGI) may surpass human capabilities at a wide range of important tasks. This report makes a case for why, without substantial action to

X-Risk Analysis for AI Research

A collection of time-tested concepts from hazard analysis and systems safety, which have been designed to steer large processes in safer directions are reviewed, to discuss how AI researchers can realistically have long-term impacts on the safety of AI systems.

Is Power-Seeking AI an Existential Risk?

This report examines what I see as the core argument for concern about existential risk from misaligned artificial intelligence. I proceed in two stages. First, I lay out a backdrop picture that

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model trained on the Pile, whose weights will be made freely and openly available to the public through a permissive license.

Goal Misgeneralization in Deep Reinforcement Learning

We study goal misgeneralization , a type of out-of-distribution generalization failure in reinforcement learning (RL). Goal misgeneralization occurs when an RL agent retains its capabilities



Markov Decision Processes: Discrete Stochastic Dynamic Programming

  • M. Puterman
  • Computer Science
    Wiley Series in Probability and Statistics
  • 1994
Markov Decision Processes covers recent research advances in such areas as countable state space models with average reward criterion, constrained models, and models with risk sensitive optimality criteria, and explores several topics that have received little or no attention in other books.

Discrete Dynamic Programming

Reinforcement Learning: An Introduction

This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.

Empowerment - an Introduction

When you made this decision, you were likely relying on a behavioural “proxy”, an internal motivation that abstracts the problem of evaluating a decision impact on your overall life, but evaluating it in regard to some simple fitness function.

Human Compatible: Artificial Intelligence and the Problem of Control

"The most important book I have read in quite some time" (Daniel Kahneman); "A must-read" (Max Tegmark); "The book we've all been waiting for" (Sam Harris) LONGLISTED FOR THE 2019 FINANCIAL TIMES AND

Why AI is harder than we think

This talk will discuss some fallacies in common assumptions made by AI researchers, which can lead to overconfident predictions about the field, and speculate on what is needed for the grand challenge of making AI systems more robust, general, and adaptable --- in short, more intelligent.

AvE: Assistance via Empowerment

This work proposes a new paradigm for assistance by increasing the human's ability to control their environment, and formalizes this approach by augmenting reinforcement learning with human empowerment, and proposes an efficient empowerment-inspired proxy metric.

Avoiding Side Effects in Complex Environments

Attainable Utility Preservation (AUP) avoids side effects by penalizing shifts in the ability to achieve randomly generated goals in toy environments by preserving optimal value for a single randomly generated reward function.

Zoom In: An Introduction to Circuits