• Corpus ID: 215416298

Evaluating Machines by their Real-World Language Use

@article{Zellers2020EvaluatingMB,
  title={Evaluating Machines by their Real-World Language Use},
  author={Rowan Zellers and Ari Holtzman and Elizabeth Clark and Lianhui Qin and Ali Farhadi and Yejin Choi},
  journal={ArXiv},
  year={2020},
  volume={abs/2004.03607}
}
There is a fundamental gap between how humans understand and use language -- in open-ended, real-world situations -- and today's NLP benchmarks for language understanding. To narrow this gap, we propose to evaluate machines by their success at real-world language use -- which greatly expands the scope of language tasks that can be measured and studied. We introduce TuringAdvice, a new challenge for language understanding systems. Given a complex situation faced by a real person, a machine must… 

Figures from this paper

Measuring Massive Multitask Language Understanding

TLDR
While most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average, however, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy.

Help! Need Advice on Identifying Advice

TLDR
Preliminary models are presented showing that while pre-trained language models are able to capture advice better than rule-based systems, advice identification is challenging, and directions for future research are identified.

Experience Grounds Language

TLDR
It is posited that the present success of representation learning approaches trained on large text corpora can be deeply enriched from the parallel tradition of research on the contextual and social nature of language.

GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

TLDR
This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks and provides formal granular evaluation metrics and identifies areas for future research.

TellMeWhy: A Dataset for Answering Why-Questions in Narratives

TLDR
This work introduces TellMeWhy, a new crowd-sourced dataset that consists of more than 30k questions and free-form answers concerning why characters in short narratives perform the actions described, and shows that state-of-the-art models are far below human performance on answering such questions.

AI and the Everything in the Whole Wide World Benchmark

TLDR
Why these benchmarks consistently fall short of capturing meaningful abstractions of the declared motivations, present distorted data lenses of a specific worldview to be optimized for, and disguise key limitations in order to misrepresent the nature of actual “state of the art” (SOTA) performance of AI systems are discussed.

Evaluation of Text Generation: A Survey

TLDR
This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.

Towards Human-Centred Explainability Benchmarks For Text Classification

TLDR
This position paper proposes to extend text classification benchmarks to evaluate the explainability of text classifiers and proposes to ground these benchmarks in human-centred applications, for example by using social media, gamification or to learn explainability metrics from human judgements.

Choose Your Own Adventure: Paired Suggestions in Collaborative Writing for Evaluating Story Generation Models

TLDR
This work presents Choose Your Own Adventure, a collaborative writing setup for pairwise model evaluation, where two models generate suggestions to people as they write a short story; writers are asked to choose one of the two suggestions, and they observe which model’s suggestions they prefer.

Feature-based detection of automated language models: tackling GPT-2, GPT-3 and Grover

TLDR
This work proposes a simple feature-based classifier for the detection problem, using carefully crafted features that attempt to model intrinsic differences between human and machine text, offering an accessible “first line ofdefense” against the abuse of language models.

References

SHOWING 1-10 OF 66 REFERENCES

Extending Machine Language Models toward Human-Level Language Understanding

TLDR
This work describes exist- ing machine models linking language to concrete situations, and point toward extensions to address more abstract cases.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

Learning and Evaluating General Linguistic Intelligence

TLDR
This work analyzes state-of-the-art natural language understanding models and conducts an extensive empirical investigation to evaluate them against general linguistic intelligence criteria, and proposes a new evaluation metric based on an online encoding of the test data that quantifies how quickly an existing agent (model) learns a new task.

Unifying Human and Statistical Evaluation for Natural Language Generation

TLDR
This paper proposes a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated, called HUSE, which is efficiently estimated by combining human and statistical evaluation.

Improving Language Understanding by Generative Pre-Training

TLDR
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.

On Evaluating and Comparing Open Domain Dialog Systems

TLDR
This paper proposes a comprehensive evaluation strategy with multiple metrics designed to reduce subjectivity by selecting metrics which correlate well with human judgement, and believes that this work is a step towards an automatic evaluation process for conversational AIs.

A large annotated corpus for learning natural language inference

TLDR
The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

Natural Language Does Not Emerge ‘Naturally’ in Multi-Agent Dialog

TLDR
This paper presents a sequence of ‘negative’ results culminating in a ‘positive’ one – showing that while most agent-invented languages are effective, they are decidedly not interpretable or compositional.

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

TLDR
A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
...