How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG

  title={How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG},
  author={Paul Trichelair and Ali Emami and Adam Trischler and Kaheer Suleman and Jackie Chi Kit Cheung},
Recent studies have significantly improved the state-of-the-art on common-sense reasoning (CSR) benchmarks like the Winograd Schema Challenge (WSC) and SWAG. The question we ask in this paper is whether improved performance on these benchmarks represents genuine progress towards common-sense-enabled systems. We make case studies of both benchmarks and design protocols that clarify and qualify the results of previous work by analyzing threats to the validity of previous experimental designs. Our… 

Tables from this paper

KARaML: Integrating Knowledge-Based and Machine Learning Approaches to Solve the Winograd Schema Challenge

KARaML, a novel asymmetric method for integrating knowledge-based and machine learning approaches to tackle the Winograd Schema Challenge, uses relational representations of natural language sentences and defines high-level patterns encoded in Answer Set Programming to identify relationships between entities based on their semantic roles.

Tackling Domain-Specific Winograd Schemas with Knowledge-Based Reasoning and Machine Learning

This paper proposes an ensemble method to combine knowledge-based reasoning and machine learning which shows the best performance in the experiments and proposes a keyword method to define a restricted domain where distinctive high-level semantic patterns can be found.

WinoLogic: A Zero-Shot Logic-based Diagnostic Dataset for Winograd Schema Challenge

A logic-based framework that focuses on high-quality commonsense knowledge, which identifies and collects formal knowledge formulas verified by theorem provers and translates such formulas into natural language sentences and proposes a new dataset named WinoLogic with these sentences.

Investigating associative, switchable and negatable Winograd items on renewed French data sets

The update of the existing French data set and the creation of three subsets allowing for a more robust, fine-grained evaluation protocol of WSC in French, showing in addition that the higher performance could be explained by the existence of associative items in FWSC.

An Analysis of Dataset Overlap on Winograd-Style Tasks

It is found that a large number of test instances overlap considerably with the pretraining corpora on which state-of-the-art models are trained, and that a significant drop in classification accuracy occurs when models are evaluated on instances with minimal overlap.

Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema

It is suggested that the apparent progress on WS may not necessarily reflect progress in commonsense reasoning, and the observed progress is mostly due to the use of supervision in training WS models, which is not likely to successfully support all the required Commonsense reasoning skills and knowledge.

Semantic Analysis of Winograd Schema No. 1

The Winograd Schema Challenge is a general test for Artificial Intelligence, based on problems of pronoun reference resolution. I investigate the semantics and interpretation of Winograd Schemas,

Towards Zero-shot Commonsense Reasoning with Self-supervised Refinement of Language Models

This paper proposes a novel self-supervised learning approach that refines the language model utilizing a set of linguistic perturbations of similar concept relationships that demonstrates the viability of zero-shot commonsense reasoning on multiple benchmarks.

Social Commonsense Reasoning with Multi-Head Knowledge Attention

This work proposes a novel multi-head knowledge attention model that encodes semi-structured commonsense inference rules and learns to incorporate them in a transformer-based reasoning cell, and is the first to demonstrate that a model that learns to perform counterfactual reasoning helps predicting the best explanation in an abductive reasoning task.

On Reality and the Limits of Language Data

The objective of this work is to explore how far can language data alone enable computers to understand the necessary truth about the physical world using a novel and tightly controlled reasoning test and to highlight what models might learn directly from pure linguistic data.



A Knowledge Hunting Framework for Common Sense Reasoning

An automatic system that achieves state-of-the-art results on the Winograd Schema Challenge (WSC), a common sense reasoning task that requires diverse, complex forms of inference and knowledge, using a knowledge hunting module to gather text from the web to serve as evidence for candidate problem resolutions.

Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge

A knowledge-rich approach to the task of resolving complex cases of definite pronouns is employed, which yields a pronoun resolver that outperforms state-of-the-art resolvers by nearly 18 points in accuracy on the authors' dataset.

SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning

The two systems that competed in this task as part of SemEval-2012 are described, and their results are compared to those achieved in previously published research.

A Simple Method for Commonsense Reasoning

Key to this method is the use of language models, trained on a massive amount of unlabled data, to score multiple choice questions posed by commonsense reasoning tests, which outperform previous state-of-the-art methods by a large margin.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

Textual Inference: getting logic from humans

The investigation showed that the human judgements used in the building of the SICK corpus can be erroneous, in this way deteriorating the quality of an otherwise useful resource.

The Winograd Schema Challenge

This paper presents an alternative to the Turing Test that has some conceptual and practical advantages, and English-speaking adults will have no difficulty with it, and the subject is not required to engage in a conversation and fool an interrogator into believing she is dealing with a person.

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

This paper introduces the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and proposes Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data.

Social IQA: Commonsense Reasoning about Social Interactions

It is established that Social IQa, the first large-scale benchmark for commonsense reasoning about social situations, is challenging for existing question-answering models based on pretrained language models, compared to human performance (>20% gap).

Annotation Artifacts in Natural Language Inference Data

It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.