An Analysis of Dataset Overlap on Winograd-Style Tasks

  title={An Analysis of Dataset Overlap on Winograd-Style Tasks},
  author={Ali Emami and Adam Trischler and Kaheer Suleman and Jackie Chi Kit Cheung},
The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model performance on the WSC has quickly progressed from chance-level to near-human using neural language models trained on massive corpora. In this paper, we analyze the effects of varying degrees of overlaps that occur between these corpora and the test instances in WSC-style tasks. We find that a large number of test instances overlap considerably with the… 

Figures and Tables from this paper

Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

This paper shows that the Russian SuperGLUE (RSG), a recently published benchmark set and leaderboard for Russian natural language understanding, is vulnerable to shallow heuristics and provides a set of recommendations on how to improve these datasets, making the RSG leaderboard even more representative of the real progress in Russian NLU.

PCR4ALL: A Comprehensive Evaluation Benchmark for Pronoun Coreference Resolution in English

This work proposes PCR4ALL, a new benchmark and a toolbox that evaluates and analyzes the performance of PCR systems from different perspectives (i.e., knowledge source, domain, data size, frequency, relevance, and polarity), and hopes that PCR4all can motivate the community to pay more attention to solving the overall PCR problem and understand the performance comprehensively.

KARaML: Integrating Knowledge-Based and Machine Learning Approaches to Solve the Winograd Schema Challenge

KARaML, a novel asymmetric method for integrating knowledge-based and machine learning approaches to tackle the Winograd Schema Challenge, uses relational representations of natural language sentences and defines high-level patterns encoded in Answer Set Programming to identify relationships between entities based on their semantic roles.

Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema

It is suggested that the apparent progress on WS may not necessarily reflect progress in commonsense reasoning, and the observed progress is mostly due to the use of supervision in training WS models, which is not likely to successfully support all the required Commonsense reasoning skills and knowledge.

Semantic Analysis of Winograd Schema No. 1

The Winograd Schema Challenge is a general test for Artificial Intelligence, based on problems of pronoun reference resolution. I investigate the semantics and interpretation of Winograd Schemas,

GooAQ: Open Question Answering with Diverse Answer Types

GOOAQ is presented, a large-scale dataset collected from Google questions and answers, containing 3 million questions with diverse answer types ranging from factual short answers to snippets to collections, and it is shown that 94% of the mined answers are accurate, enabling fine-tuning a pre-trained language model for answering GOOAq questions.

Measuring Causal Effects of Data Statistics on Language Model's 'Factual' Predictions

The causal framework provides a language for describing how training data causes a model to make a certain prediction, through a causal framework, and demonstrates the importance of studying datasets and the benefits of causality for understanding NLP models.

Changing the World by Changing the Data

This position paper maps out the arguments for and against data curation, and argues that fundamentally the point is moot: curation already is and will be happening, and it is changing the world.

Language models show human-like content effects on reasoning

This work hypothesized that language models would show human-like content content on abstract reasoning problems, and explored this hypothesis across three logical reasoning tasks: natural language inference, judging the logical validity of syllogisms, and the Wason selection task.



A Surprisingly Robust Trick for the Winograd Schema Challenge

This paper shows that the performance of three language models on WSC273 strongly improves when fine-tuned on a similar pronoun disambiguation problem dataset (denoted WSCR), and generates a large unsupervised WSC-like dataset.

Exploring Unsupervised Pretraining and Sentence Structure Modelling for Winograd Schema Challenge

It is demonstrated that the leading performance benefits from jointly modelling sentence structures, utilizing knowledge learned from cutting-edge pretraining models, and performing fine-tuning.

A Simple Method for Commonsense Reasoning

Key to this method is the use of language models, trained on a massive amount of unlabled data, to score multiple choice questions posed by commonsense reasoning tests, which outperform previous state-of-the-art methods by a large margin.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge

A knowledge-rich approach to the task of resolving complex cases of definite pronouns is employed, which yields a pronoun resolver that outperforms state-of-the-art resolvers by nearly 18 points in accuracy on the authors' dataset.

WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale

This work introduces WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset, and establishes new state-of-the-art results on five related benchmarks.

The Sensitivity of Language Models and Humans to Winograd Schema Perturbations

Results highlight interesting differences between humans and language models: language models are more sensitive to number or gender alternations and synonym replacements than humans, and humans are more stable and consistent in their predictions, maintain a much higher absolute performance, and perform better on non-associative instances than associative ones.

The KnowRef Coreference Corpus: Removing Gender and Number Cues for Difficult Pronominal Anaphora Resolution

A new benchmark for coreference resolution and NLI, KnowRef, that targets common-sense understanding and world knowledge is introduced and a data-augmentation trick called antecedent switching is proposed to alleviate this tendency in models.

Towards Addressing the Winograd Schema Challenge - Building and Using a Semantic Parser and a Knowledge Hunting Module

This paper presents an approach that identifies the knowledge needed to answer a challenge question, hunts down that knowledge from text repositories, and then reasons with machines to come up with the answer.

An Example-Based Approach to Difficult Pronoun Resolution

The experimental results show that the existing sentences on the Web indeed contain instances of world knowledge useful for difficult pronoun resolution, and a method for automatically acquiring examples that are similar to Winograd schemas but have less ambiguity is presented.