Alignment for Advanced Machine Learning Systems

@article{Taylor2020AlignmentFA,
  title={Alignment for Advanced Machine Learning Systems},
  author={Jessica Taylor and Eliezer Yudkowsky and Patrick LaVictoire and Andrew Critch},
  journal={Ethics of Artificial Intelligence},
  year={2020}
}
This chapter surveys eight research areas organized around one question: As learning systems become increasingly intelligent and autonomous, what design principles can best ensure that their behavior is aligned with the interests of the operators? The chapter focuses on two major technical obstacles to AI alignment: the challenge of specifying the right kind of objective functions and the challenge of designing AI systems that avoid unintended consequences and undesirable behavior even in cases… 

Machine Learning Approaches for Principle Prediction in Naturally Occurring Stories

This work explores the use of machine learning models for the task of normative principle prediction on naturally occurring story data and shows that while individual principles can be classified, the ambiguity of what "moral principles" represent poses a challenge for both human participants and autonomous systems which are faced with the same task.

Active Learning Helps Pretrained Models Learn the Intended Task

This work investigates whether pretrained models are better active learners, capable of disambiguating between the possible tasks a user may be trying to specify, and finds that better active learning is an emergent property of the pretraining process.

Towards Safe Artificial General Intelligence

The central conclusion is that while reinforcement learning systems as designed today are inherently unsafe to scale to human levels of intelligence, there are ways to potentially address many of these issues without straying too far from the currently so successful reinforcement learning paradigm.

Learning Norms from Stories: A Prior for Value Aligned Agents

This work trains multiple machine learning models to classify natural language descriptions of situations found in the comic strip as normative or non-normative by identifying if they align with the main characters' behavior.

Many Kinds of Minds are Better than One : Value Alignment Through Dialogue

The need to ensure AI acts in accordance with human values has prompted considerable intellectual investment into the ‘value loading (alignment) problem’, more broadly understood as the problem of how to design ‘ethical agents’.

On the Ethics of Building AI in a Responsible Manner

It is proved that most machine learning algorithms that are being used in practice today do not suffer from the strategic-AI-alignment problem and without being careful, today's technology might lead to strategic misalignment.

Fundamental Limitations of Alignment in Large Language Models

It is proved that for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt, implying that any alignment process that attenuates undesired behavior but does not remove it altogether, is not safe against adversarial prompting attacks.

Scalable agent alignment via reward modeling: a research direction

This work outlines a high-level research direction to solve the agent alignment problem centered around reward modeling: learning a reward function from interaction with the user and optimizing the learned reward function with reinforcement learning.

FHI Oxford Technical Report # 2018-2 Predicting Human Deliberative Judgments with Machine Learning

An ML prediction task for predicting deliberative judgments given a training set that also contains fast judgments and the motivation for the project is explained and how further work can avoid mistakes is explained.

Recycling diverse models for out-of-distribution generalization

This paper proposes model ratatouille, a new strategy to recycle the multiple fine-tunings of the same foundation model on diverse auxiliary tasks, which aims at maximizing the diversity in weights by leveraging the Diversity in auxiliary tasks.
...

Concrete Problems in AI Safety

A list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function, an objective function that is too expensive to evaluate frequently, or undesirable behavior during the learning process, are presented.

Exploratory Gradient Boosting for Reinforcement Learning in Complex Domains

A gradient-boosting style, non-parametric function approximator for learning on $Q$-function residuals and an exploration strategy inspired by the principles of state abstraction and information acquisition under uncertainty are proposed.

What artificial experts can and cannot do

From the perspective of this work, one should not try to enhance expertise as in traditional AI by attempting to construct improved theories of a domain, but rather by improving the learner's access to the relevant aspects of adomain so as to facilitate learning from experience.

Interactively shaping agents via human reinforcement: the TAMER framework

Results from two domains demonstrate that lay users can train TAMER agents without defining an environmental reward function (as in an MDP) and indicate that human training within the TAMER framework can reduce sample complexity over autonomous learning algorithms.

Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda

In this chapter, a host of technical problems that AI scientists could work on to ensure that the creation of smarter-than-human machine intelligence has a positive impact are discussed.

Letter to the Editor: Research Priorities for Robust and Beneficial Artificial Intelligence: An Open Letter

It is believed that research on how to make AI systems robust and beneficial is both important and timely, and that there are concrete research directions that can be pursued today.

Learning What to Value

I. J. Good's intelligence explosion theory predicts that ultraintelligent agents will undergo a process of repeated self-improvement; in the wake of such an event, how well our values are fulfilled

Using informative behavior to increase engagement while learning from human reward

The results suggest that the organizational maxim about human behavior, “you get what you measure”—i.e., sharing metrics with people causes them to focus on optimizing those metrics while de-emphasizing other objectives—also applies to the training of agents.

Active lmitation learning: formal and practical reductions to I.I.D. learning

This paper considers active imitation learning with the goal of reducing this effort by querying the expert about the desired action at individual states, which are selected based on answers to past queries and the learner's interactions with an environment simulator.

Active Learning Literature Survey

This report provides a general introduction to active learning and a survey of the literature, including a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date.
...