• Corpus ID: 237263476

CommonsenseQA 2.0: Exposing the Limits of AI through Gamification

  title={CommonsenseQA 2.0: Exposing the Limits of AI through Gamification},
  author={Alon Talmor and Ori Yoran and Ronan Le Bras and Chandrasekhar Bhagavatula and Yoav Goldberg and Yejin Choi and Jonathan Berant},
  booktitle={NeurIPS Datasets and Benchmarks},
Constructing benchmarks that test the abilities of modern natural language understanding models is difficult – pre-trained language models exploit artifacts in benchmarks to achieve human parity, but still fail on adversarial examples and make errors that demonstrate a lack of common sense. In this work, we propose gamification as a framework for data construction. The goal of players in the game is to compose questions that mislead a rival AI, while using specific phrases for extra points. The… 

Evaluate Confidence Instead of Perplexity for Zero-shot Commonsense Reasoning

A novel commonsense reasoning metric, Non-Replacement Confidence (NRC), which works on PLMs according to the Replaced Token Detection pre-training objective in ELECTRA, and shows that pre-endowed commonsense knowledge, especially for RTD-based PLMs, is essential in downstream reasoning.

WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation

This work introduces a novel approach for dataset creation based on worker and AI collaboration, which brings together the generative strength of language models and the evaluative power of humans.

Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention

It is found that the proposed external attention mechanism can significantly improve the performance of existing AI systems, allowing practitioners to easily customize foundation AI models to many diverse downstream applications.

Great Truths are Always Simple: A Rather Simple Knowledge Encoder for Enhancing the Commonsense Reasoning Capacity of Pre-Trained Models

A deep empirical analysis finds that it is indeed relation features from CSKGs (but not node features ) that mainly contribute to the performance improvement of PTMs, and designs a simple MLP-based knowledge encoder that utilizes statistical relation paths as features.

Unifying Language Learning Paradigms

UL2 achieves SOTA performance on 50 well-established supervised NLP tasks ranging from language generation, language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval.

Are AI systems biased against the poor? A machine learning analysis using Word2Vec and GloVe embeddings

Among the myriad of technical approaches and abstract guidelines proposed to the topic of AI bias, there has been an urgent call to translate the principle of fairness into the operational AI reality

Elaboration-Generating Commonsense Question Answering at Scale

This work uses smaller language models to generate useful intermediate context, referred to here as elaborations, and alternates between updating two language models—an elaboration generator and an answer predictor—allowing each to influence the other.

Symbolic Knowledge Distillation: from General Language Models to Commonsense Models

It is demonstrated that careful prompt engineering and a separately trained critic model allow us to selectively distill high-quality causal commonsense from GPT-3, a general language model, and results in a neural commonsense model that surpasses the teacher model's commonsense capabilities despite its 100x smaller size.

On Reality and the Limits of Language Data

The objective of this work is to explore how far can language data alone enable computers to understand the necessary truth about the physical world using a novel and tightly controlled reasoning test and to highlight what models might learn directly from pure linguistic data.

Benchmarking GPT-3 For Closed-Book QA: Strengths and Weaknesses

This work thoroughly evaluates GPT-3 on the task of closed-book question answering (CBQA), aiming to gain a better understanding of the strengths and weaknesses of G PT-3 in terms of different reasoning capabilities.



SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

This paper introduces the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and proposes Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data.

HellaSwag: Can a Machine Really Finish Your Sentence?

The construction of HellaSwag, a new challenge dataset, and its resulting difficulty, sheds light on the inner workings of deep pretrained models, and suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension

This work investigates this annotation methodology and applies it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop, finding that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop.

Fool Me Twice: Entailment from Wikipedia Gamification

FoolMeTwice (FM2 for short), a large dataset of challenging entailment pairs collected through a fun multi-player game that leads to diverse strategies for crafting claims, such as temporal inference and diverting to unrelated evidence, results in higher quality data for the entailment and evidence retrieval tasks.

oLMpics-On What Language Model Pre-training Captures

This work proposes eight reasoning tasks, which conceptually require operations such as comparison, conjunction, and composition, and findings can help future work on designing new datasets, models, and objective functions for pre-training.

Beat the Machine: Challenging Humans to Find a Predictive Model's “Unknown Unknowns”

This article presents a system that, in a game-like setting, asks humans to identify cases that will cause the predictive model-based system to fail, and shows that the humans using Beat the Machine identify more errors than do traditional techniques for discovering errors in predictive models.

WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale

This work introduces WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset, and establishes new state-of-the-art results on five related benchmarks.

Adversarial Filters of Dataset Biases

This work presents extensive supporting evidence that AFLite is broadly applicable for reduction of measurable dataset biases, and that models trained on the filtered datasets yield better generalization to out-of-distribution tasks.

Semantically Equivalent Adversarial Rules for Debugging NLP models

This work presents semantically equivalent adversaries (SEAs) – semantic-preserving perturbations that induce changes in the model’s predictions that induce adversaries on many instances that are extremely similar semantically.

Adversarial NLI: A New Benchmark for Natural Language Understanding

This work introduces a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure, and shows that non-expert annotators are successful at finding their weaknesses.