Corpus ID: 220968818

Aligning AI With Shared Human Values

@article{Hendrycks2021AligningAW,
  title={Aligning AI With Shared Human Values},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and J. Li and D. Song and J. Steinhardt},
  journal={ArXiv},
  year={2021},
  volume={abs/2008.02275}
}
We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents… Expand

Figures and Tables from this paper

Language Models have a Moral Dimension
TLDR
Modern LMs are shown to store ethical and moral values of the society and actually bring a “moral dimension” to surface, providing a path for attenuating or even preventing toxic degeneration in LMs. Expand
Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences
TLDR
This work investigates whether contemporary NLG models can function as behavioral priors for systems deployed in social settings by generating action hypotheses that achieve predefined goals under moral constraints and introduces Moral Stories, a crowd-sourced dataset of structured, branching narratives for the study of grounded, goaloriented social reasoning. Expand
Towards Understanding and Mitigating Social Biases in Language Models
TLDR
The empirical results and human evaluation demonstrate effectiveness in mitigating bias while retaining crucial contextual information for highfidelity text generation, thereby pushing forward the performance-fairness Pareto frontier. Expand
The Role of Arts in Shaping AI Ethics
Despite the significant progress made in recent years, there seems to be a visible bottleneck in transforming artificial intelligence (AI) technologies into large scale systems of ethical value.Expand
When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings
TLDR
It is shown that domain pretraining may be warranted when the task exhibits sufficient similarity to the pretraining corpus: the level of performance increase in three legal tasks was directly tied to the domain specificity of the task. Expand
The R-U-A-Robot Dataset: Helping Avoid Chatbot Deception by Detecting User Questions About Human or Non-Human Identity
TLDR
This work explores how both a generative research model (Blender) as well as two deployed systems (Amazon Alexa, Google Assistant) handle this intent, finding that systems often fail to confirm their nonhuman identity. Expand
Analysis and Prediction of NLP models via Task Embeddings
  • 2021
Relatedness between tasks, which is key to transfer learning, is often characterized by measuring the influence of tasks on one another during sequential or simultaneous training, with tasks beingExpand
Measuring Coding Challenge Competence With APPS
TLDR
APPS, a benchmark for code generation that measures the ability of models to take an arbitrary natural language specification and generate Python code fulfilling this specification, and finds that machine learning models are beginning to learn how to code. Expand
Natural Adversarial Examples
TLDR
It is shown that some architectural changes can enhance robustness to natural adversarial examples and be a new way to measure classifier robustness in an ImageNet classifier test set that is called ImageNet-A. Expand
Practical Machine Learning Safety: A Survey and Primer
TLDR
This research presents a meta-analyses of the immune system’s response to machine learning and its applications in the context of social reinforcement learning and reinforcement learning. Expand
...
1
2
...

References

SHOWING 1-10 OF 84 REFERENCES
PIQA: Reasoning about Physical Commonsense in Natural Language
TLDR
The task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA are introduced and analysis about the dimensions of knowledge that existing models lack are provided, which offers significant opportunities for future research. Expand
Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset
TLDR
This work proposes a new benchmark for empathetic dialogue generation and EmpatheticDialogues, a novel dataset of 25k conversations grounded in emotional situations, and presents empirical comparisons of dialogue model adaptations forEmpathetic responding, leveraging existing models or datasets without requiring lengthy re-training of the full model. Expand
Recipes for Building an Open-Domain Chatbot
TLDR
Human evaluations show the best models outperform existing approaches in multi-turn dialogue on engagingness and humanness measurements, and the limitations of this work are discussed by analyzing failure cases of the models. Expand
Troubling Trends in Machine Learning Scholarship
TLDR
The current strength of machine learning owes to a large body of rigorous research to date, both theoretical and empirical, so the community can sustain the trust and investment it currently enjoys. Expand
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence. Expand
Evaluating NLP Models via Contrast Sets
TLDR
A new annotation paradigm for NLP is proposed that helps to close systematic gaps in the test data, and it is recommended that after a dataset is constructed, the dataset authors manually perturb the test instances in small but meaningful ways that change the gold label, creating contrast sets. Expand
Human Instruction-Following with Deep Reinforcement Learning via Transfer-Learning from Text
TLDR
This work proposes a conceptually simple method for training instruction-following agents with deep RL that are robust to natural human instructions, and demonstrates substantially-above-chance zero-shot transfer from synthetic template commands to natural instructions given by humans. Expand
Deep Reinforcement Learning from Human Preferences
TLDR
This work explores goals defined in terms of (non-expert) human preferences between pairs of trajectory segments in order to effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion. Expand
Learning the Difference that Makes a Difference with Counterfactually-Augmented Data
TLDR
This paper focuses on natural language processing, introducing methods and resources for training models less sensitive to spurious patterns, and task humans with revising each document so that it accords with a counterfactual target label and retains internal coherence. Expand
The EU Approach to Ethics Guidelines for Trustworthy Artificial Intelligence
As part of its European strategy for Artificial Intelligence (AI), and as a response to the increasing ethical questions raised by this technology, the European Commission established an independentExpand
...
1
2
3
4
5
...