Back to Square One: Bias Detection, Training and Commonsense Disentanglement in the Winograd Schema

  title={Back to Square One: Bias Detection, Training and Commonsense Disentanglement in the Winograd Schema},
  author={Yanai Elazar and Hongming Zhang and Yoav Goldberg and Dan Roth},
The Winograd Schema (WS) has been proposed as a test for measuring commonsense capabilities of models. Recently, pre-trained language model-based approaches have boosted performance on some WS benchmarks but the source of improvement is still not clear. This paper suggests that the apparent progress on WS may not necessarily reflect progress in commonsense reasoning. To support this claim, we first show that the current evaluation method of WS is sub-optimal and propose a modification that uses… 

Figures and Tables from this paper

The Defeat of the Winograd Schema Challenge
The history of the Winograd Schema Challenge is reviewed, and a number of AI systems, based on large pre-trained transformer-based language models and fine-tuned on these kinds of problems, achieved better than 90% accuracy.
A Systematic Investigation of Commonsense Understanding in Large Language Models
It is found that the impressive zeroshot performance of large language models is mostly due to existence of dataset bias in the authors' benchmarks, and that leveraging explicit commonsense knowledge does not yield substantial improvement.
ASER: Towards Large-scale Commonsense Knowledge Acquisition via Higher-order Selectional Preference over Eventualities
The definition of selectional preference is generalize from one-hop linguistic syntactic relations to higher-order relations over linguistic graphs and develops a large-scale eventuality (a linguistic term covering activity, state, and event)-based knowledge graph ASER, where each eventuality is represented as a dependency graph, and the relation between them is a discourse relation defined in shallow discourse parsing.
Attention-based Contrastive Learning for Winograd Schemas
This paper investigates whether contrastive learning can be extended to Transfomer attention to tackling the Winograd Schema Challenge, and proposes a novel self-supervised framework, leveraging a contrastive loss directly at the level of self-attention.
Commonsense Knowledge in Word Associations and ConceptNet
An in-depth comparison of two large-scale resources of general knowledge: ConceptNet, an engineered relational database, and SWOW, a knowledge graph derived from crowd-sourced word associations shows empirically that both resources improve downstream task performance on commonsense reasoning benchmarks over text-only baselines.
Dimensions of Commonsense Knowledge
This paper surveys a wide range of popular commonsense sources with a special focus on their relations, and consolidates these relations into 13 knowledge dimensions, each abstracting over more specific relations found in sources.
Exploring Strategies for Generalizable Commonsense Reasoning with Pre-trained Models
The experiments show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers, while alternative adaptation methods like prefixtuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
Towards Zero-shot Commonsense Reasoning with Self-supervised Refinement of Language Models
An initial study exploring the feasibility of zero-shot commonsense reasoning for the Winograd Schema Challenge by formulating the task as selfsupervised refinement of a pre-trained language model utilizing a set of linguistic perturbations of similar concept relationships.


ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.
Evaluating commonsense in pretrained language models
  • AAAI, pages 9733– 9740.
  • 2020
Learning the Difference that Makes a Difference with Counterfactually-Augmented Data
This paper focuses on natural language processing, introducing methods and resources for training models less sensitive to spurious patterns, and task humans with revising each document so that it accords with a counterfactual target label and retains internal coherence.
Precise Task Formalization Matters in Winograd Schema Evaluations
This work performs an ablation on two Winograd Schema datasets that interpolates between the formalizations used before and after this surge, and finds framing the task as multiple choice improves performance by 2-6 points and several additional techniques can mitigate the model's extreme sensitivity to hyperparameters.
The Sensitivity of Language Models and Humans to Winograd Schema Perturbations
Results highlight interesting differences between humans and language models: language models are more sensitive to number or gender alternations and synonym replacements than humans, and humans are more stable and consistent in their predictions, maintain a much higher absolute performance, and perform better on non-associative instances than associative ones.
WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale
This work introduces WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset, and establishes new state-of-the-art results on five related benchmarks.
How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG
This paper makes case studies of both benchmarks and design protocols that clarify and qualify the results of previous work by analyzing threats to the validity of previous experimental designs.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
The Winograd Schema Challenge
This paper presents an alternative to the Turing Test that has some conceptual and practical advantages, and English-speaking adults will have no difficulty with it, and the subject is not required to engage in a conversation and fool an interrogator into believing she is dealing with a person.
Language Models are Few-Shot Learners
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.