Corpus ID: 215416298

Evaluating Machines by their Real-World Language Use

  title={Evaluating Machines by their Real-World Language Use},
  author={Rowan Zellers and Ari Holtzman and Elizabeth Clark and Lianhui Qin and Ali Farhadi and Yejin Choi},
There is a fundamental gap between how humans understand and use language -- in open-ended, real-world situations -- and today's NLP benchmarks for language understanding. To narrow this gap, we propose to evaluate machines by their success at real-world language use -- which greatly expands the scope of language tasks that can be measured and studied. We introduce TuringAdvice, a new challenge for language understanding systems. Given a complex situation faced by a real person, a machine must… Expand
Help! Need Advice on Identifying Advice
Preliminary models are presented showing that while pre-trained language models are able to capture advice better than rule-based systems, advice identification is challenging, and directions for future research are identified. Expand
Feature-based detection of automated language models: tackling GPT-2, GPT-3 and Grover
This work proposes a simple feature-based classifier for the detection problem, using carefully crafted features that attempt to model intrinsic differences between human and machine text, offering an accessible “first line ofdefense” against the abuse of language models. Expand
Evaluation of Text Generation: A Survey
This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models. Expand
Choose Your Own Adventure: Paired Suggestions in Collaborative Writing for Evaluating Story Generation Models
This work presents Choose Your Own Adventure, a collaborative writing setup for pairwise model evaluation, where two models generate suggestions to people as they write a short story; writers are asked to choose one of the two suggestions, and they observe which model’s suggestions they prefer. Expand
Forecasting AI Progress: A Research Agenda
The development of a research agenda for forecasting AI progress was described which utilized the Delphi technique to elicit and aggregate experts' opinions on what questions and methods to prioritize. Expand
GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation
This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks and provides formal granular evaluation metrics and identifies areas for future research. Expand
Measuring Massive Multitask Language Understanding
While most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average, however, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Expand
TellMeWhy: A Dataset for Answering Why-Questions in Narratives
This work introduces TellMeWhy, a new crowd-sourced dataset that consists of more than 30k questions and free-form answers concerning why characters in short narratives perform the actions described, and shows that state-of-the-art models are far below human performance on answering such questions. Expand
What Will it Take to Fix Benchmarking in Natural Language Understanding?
It is argued most current benchmarks fail at these criteria, and that adversarially-constructed, out-of-distribution test sets does not meaningfully address the causes of these failures. Expand
Experience Grounds Language
It is posited that the present success of representation learning approaches trained on large text corpora can be deeply enriched from the parallel tradition of research on the contextual and social nature of language. Expand


GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. Expand
Learning and Evaluating General Linguistic Intelligence
This work analyzes state-of-the-art natural language understanding models and conducts an extensive empirical investigation to evaluate them against general linguistic intelligence criteria, and proposes a new evaluation metric based on an online encoding of the test data that quantifies how quickly an existing agent (model) learns a new task. Expand
Unifying Human and Statistical Evaluation for Natural Language Generation
This paper proposes a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated, called HUSE, which is efficiently estimated by combining human and statistical evaluation. Expand
On Evaluating and Comparing Open Domain Dialog Systems
This paper proposes a comprehensive evaluation strategy with multiple metrics designed to reduce subjectivity by selecting metrics which correlate well with human judgement, and believes that this work is a step towards an automatic evaluation process for conversational AIs. Expand
Improving Language Understanding by Generative Pre-Training
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. Expand
A large annotated corpus for learning natural language inference
The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time. Expand
Natural Language Does Not Emerge ‘Naturally’ in Multi-Agent Dialog
This paper presents a sequence of ‘negative’ results culminating in a ‘positive’ one – showing that while most agent-invented languages are effective, they are decidedly not interpretable or compositional. Expand
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented. Expand
PIQA: Reasoning about Physical Commonsense in Natural Language
The task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA are introduced and analysis about the dimensions of knowledge that existing models lack are provided, which offers significant opportunities for future research. Expand
On Making Reading Comprehension More Comprehensive
This work justifies a question answering approach to reading comprehension and describes the various kinds of questions one might use to more fully test a system’s comprehension of a passage, moving beyond questions that only probe local predicate-argument structures. Expand