Corpus ID: 233289738

Back to Square One: Bias Detection, Training and Commonsense Disentanglement in the Winograd Schema

  title={Back to Square One: Bias Detection, Training and Commonsense Disentanglement in the Winograd Schema},
  author={Yanai Elazar and Hongming Zhang and Yoav Goldberg and D. Roth},
The Winograd Schema (WS) has been proposed as a test for measuring commonsense capabilities of models. Recently, pre-trained language model-based approaches have boosted performance on some WS benchmarks but the source of improvement is still not clear. We begin by showing that the current evaluation method of WS is sub-optimal and propose a modification that makes use of twin sentences for evaluation. We also propose two new baselines that indicate the existence of biases in WS benchmarks… Expand

Figures and Tables from this paper

Attention-based Contrastive Learning for Winograd Schemas
Self-supervised learning has recently attracted considerable attention in the NLP community for its ability to learn discriminative features using a contrastive objective (Qu et al., 2020; Klein andExpand
Dimensions of Commonsense Knowledge
This paper surveys a wide range of popular commonsense sources with a special focus on their relations, and consolidates these relations into 13 knowledge dimensions, each abstracting over more specific relations found in sources. Expand
Exploring Strategies for Generalizable Commonsense Reasoning with Pre-trained Models
Commonsense reasoning benchmarks have been largely solved by fine-tuning language models. The downside is that fine-tuning may cause models to overfit to task-specific data and thereby forget theirExpand
Towards Zero-shot Commonsense Reasoning with Self-supervised Refinement of Language Models
Can we get existing language models and refine them for zero-shot commonsense reasoning? This paper presents an initial study exploring the feasibility of zero-shot commonsense reasoning for theExpand


ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence. Expand
WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale
This work introduces WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset, and establishes new state-of-the-art results on five related benchmarks. Expand
Evaluating commonsense in pretrained language models
  • AAAI, pages 9733– 9740.
  • 2020
Learning the Difference that Makes a Difference with Counterfactually-Augmented Data
This paper focuses on natural language processing, introducing methods and resources for training models less sensitive to spurious patterns, and task humans with revising each document so that it accords with a counterfactual target label and retains internal coherence. Expand
Precise Task Formalization Matters in Winograd Schema Evaluations
This work performs an ablation on two Winograd Schema datasets that interpolates between the formalizations used before and after this surge, and finds framing the task as multiple choice improves performance by 2-6 points and several additional techniques can mitigate the model's extreme sensitivity to hyperparameters. Expand
The Sensitivity of Language Models and Humans to Winograd Schema Perturbations
Results highlight interesting differences between humans and language models: language models are more sensitive to number or gender alternations and synonym replacements than humans, and humans are more stable and consistent in their predictions, maintain a much higher absolute performance, and perform better on non-associative instances than associative ones. Expand
How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG
This paper makes case studies of both benchmarks and design protocols that clarify and qualify the results of previous work by analyzing threats to the validity of previous experimental designs. Expand
RoBERTa: A Robustly Optimized BERT Pretraining Approach
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD. Expand
The Winograd Schema Challenge
This paper presents an alternative to the Turing Test that has some conceptual and practical advantages, and English-speaking adults will have no difficulty with it, and the subject is not required to engage in a conversation and fool an interrogator into believing she is dealing with a person. Expand
An Analysis of Dataset Overlap on Winograd-Style Tasks
It is found that a large number of test instances overlap considerably with the pretraining corpora on which state-of-the-art models are trained, and that a significant drop in classification accuracy occurs when models are evaluated on instances with minimal overlap. Expand