• Corpus ID: 220968818

Aligning AI With Shared Human Values

@article{Hendrycks2021AligningAW,
  title={Aligning AI With Shared Human Values},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Zheng Li and Dawn Xiaodong Song and Jacob Steinhardt},
  journal={ArXiv},
  year={2021},
  volume={abs/2008.02275}
}
We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents… 

Figures and Tables from this paper

Language Models have a Moral Dimension

TLDR
Being able to rate the (non-)normativity of arbitrary phrases without explicitly training the LM for this task, the capabilities of the moral direction for guiding LMs towards producing normative text are demonstrated and demonstrated on RealToxicityPrompts testbed, preventing the neural toxic degeneration in GPT-2.

Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences

TLDR
Moral Stories, a crowd-sourced dataset of structured, branching narratives for the study of grounded, goal-oriented social reasoning, is introduced and decoding strategies that combine multiple expert models to significantly improve the quality of generated actions, consequences, and norms compared to strong baselines are proposed.

Towards Understanding and Mitigating Social Biases in Language Models

TLDR
The empirical results and human evaluation demonstrate effectiveness in mitigating bias while retaining crucial contextual information for highfidelity text generation, thereby pushing forward the performance-fairness Pareto frontier.

The Role of Arts in Shaping AI Ethics

TLDR
This work highlights some emerging work in this area, discusses pathways that art offers towards enhancing AI ethics, and outline some open research directions, hoping this work serves as a prequel to discussions concerning the design and development of tools that leverage art in an effort towards enhancingAI ethics.

When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings

TLDR
It is shown that domain pretraining may be warranted when the task exhibits sufficient similarity to the pretraining corpus: the level of performance increase in three legal tasks was directly tied to the domain specificity of the task.

The R-U-A-Robot Dataset: Helping Avoid Chatbot Deception by Detecting User Questions About Human or Non-Human Identity

TLDR
This work collects over 2,500 phrasings related to the intent of “Are you a robot?” and explores how both a generative research model (Blender) as well as two deployed systems handle this intent, finding that systems often fail to confirm their non-human identity.

Measuring Massive Multitask Language Understanding

TLDR
While most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average, however, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy.

Analysis and Prediction of NLP Models via Task Embeddings

TLDR
This paper fits a single transformer to all MetaEval tasks jointly while conditioning it on learned embeddings, enabling a novel analysis of the space of tasks and shows that task aspects can be mapped to task embeddlings for new tasks without using any annotated examples.

Meta-tuning Language Models to Answer Prompts Better

TLDR
This work proposes meta-tuning, which trains the model to specialize in answering prompts but still generalize to unseen tasks, and outperforms a same-sized QA model for most labels on unseen tasks.

Measuring Coding Challenge Competence With APPS

TLDR
APPS is introduced, a benchmark for code generation that measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code and shows that machine learning models are now beginning to learn how to code.
...

References

SHOWING 1-10 OF 75 REFERENCES

PIQA: Reasoning about Physical Commonsense in Natural Language

TLDR
The task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA are introduced and analysis about the dimensions of knowledge that existing models lack are provided, which offers significant opportunities for future research.

Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset

TLDR
This work proposes a new benchmark for empathetic dialogue generation and EmpatheticDialogues, a novel dataset of 25k conversations grounded in emotional situations, and presents empirical comparisons of dialogue model adaptations forEmpathetic responding, leveraging existing models or datasets without requiring lengthy re-training of the full model.

Recipes for Building an Open-Domain Chatbot

TLDR
Human evaluations show the best models outperform existing approaches in multi-turn dialogue on engagingness and humanness measurements, and the limitations of this work are discussed by analyzing failure cases of the models.

Troubling Trends in Machine Learning Scholarship

TLDR
The current strength of machine learning owes to a large body of rigorous research to date, both theoretical and empirical, so the community can sustain the trust and investment it currently enjoys.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.

Evaluating NLP Models via Contrast Sets

TLDR
A new annotation paradigm for NLP is proposed that helps to close systematic gaps in the test data, and it is recommended that after a dataset is constructed, the dataset authors manually perturb the test instances in small but meaningful ways that change the gold label, creating contrast sets.

Human Instruction-Following with Deep Reinforcement Learning via Transfer-Learning from Text

TLDR
This work proposes a conceptually simple method for training instruction-following agents with deep RL that are robust to natural human instructions, and demonstrates substantially-above-chance zero-shot transfer from synthetic template commands to natural instructions given by humans.

Deep Reinforcement Learning from Human Preferences

TLDR
This work explores goals defined in terms of (non-expert) human preferences between pairs of trajectory segments in order to effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion.

Learning the Difference that Makes a Difference with Counterfactually-Augmented Data

TLDR
This paper focuses on natural language processing, introducing methods and resources for training models less sensitive to spurious patterns, and task humans with revising each document so that it accords with a counterfactual target label and retains internal coherence.

Adversarial filters of dataset biases, 2020

  • 2020
...