Social IQA: Commonsense Reasoning about Social Interactions

  title={Social IQA: Commonsense Reasoning about Social Interactions},
  author={Maarten Sap and Hannah Rashkin and Derek Chen and Ronan Le Bras and Yejin Choi},
  booktitle={EMNLP 2019},
We introduce Social IQa, the first largescale benchmark for commonsense reasoning about social situations. [] Key Method Through crowdsourcing, we collect commonsense questions along with correct and incorrect answers about social interactions, using a new framework that mitigates stylistic artifacts in incorrect answers by asking workers to provide the right answer to a different but related question. Empirical results show that our benchmark is challenging for existing question-answering models based on…

Figures and Tables from this paper

RiddleSense: Answering Riddle Questions as Commonsense Reasoning

RIDDLESENSE1 is proposed, a novel multiple-choice question answering challenge for benchmarking higher-order commonsense reasoning models, which is the first large dataset for riddle-style commonsense question answering, where the distractors are crowdsourced from human annotators.

RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge

RIDDLESENSE1, a new multiple-choice question answering task, is presented, which comes with the first large dataset (5.7k examples) for answering riddlestyle commonsense questions and it is pointed out that there is a large gap between the bestsupervised model and human performance.

COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences

This work introduces a new commonsense reasoning benchmark dataset comprising natural language true/false statements, with each sample paired with its complementary counterpart, resulting in 4k sentence pairs, and proposes a pairwise accuracy metric to reliably measure an agent’s ability to perform Commonsense reasoning over a given situation.

PIQA: Reasoning about Physical Commonsense in Natural Language

The task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA are introduced and analysis about the dimensions of knowledge that existing models lack are provided, which offers significant opportunities for future research.

Towards Generative Commonsense Reasoning: A Concept Paper

This concept paper discusses the reasons why large pre-trained language encoders like BERT can easily achieve the state-of-the-art performance on multi-choice question answering datasets, and advocates to evaluate the machine commonsense reasoning ability in a way of controlled language generation.

Go Beyond Plain Fine-Tuning: Improving Pretrained Models for Social Commonsense

This study focuses on the Social IQA dataset, a task requiring social and emotional commonsense reasoning, and proposes several architecture variations and extensions, as well as leveraging external commonsense corpora to optimize the model for SocialIQA.

Commonsense-Focused Dialogues for Response Generation: An Empirical Study

This paper auto-extract commonsensical dialogues from existing dialogue datasets by leveraging ConceptNet, a commonsense knowledge graph, and proposes an approach for automatic evaluation of commonsense that relies on features derived from ConceptNet and pre-trained language and dialog models, and shows reasonable correlation with human evaluation of responses’ commonsense quality.

A Semantic-based Method for Unsupervised Commonsense Question Answering

This paper presents a novel SEmantic-based Question Answering method (SEQA) for unsupervised commonsense question answering that first generates a set of plausible answers with generative models, and then uses these plausible answers to select the correct choice by considering the semantic similarity between each plausible answer and each choice.

Semantic Categorization of Social Knowledge for Commonsense Question Answering

This work proposed to categorize the semantics needed for these QA tasks using the SocialIQA as an example, and further train neural QA models to incorporate such social knowledge categories and relation information from a knowledge base.

Prompting Contrastive Explanations for Commonsense Reasoning Tasks

Inspired by the contrastive nature of human explanations, large pretrained language models are used to complete explanation prompts which contrast alternatives according to the key attribute(s) required to justify the correct answer.



CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

This work presents CommonsenseQA: a challenging new dataset for commonsense question answering, which extracts from ConceptNet multiple target concepts that have the same semantic relation to a single source concept.

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

This paper introduces the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and proposes Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data.

Know What You Don’t Know: Unanswerable Questions for SQuAD

SQuadRUn is a new dataset that combines the existing Stanford Question Answering Dataset (SQuAD) with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.

From Recognition to Cognition: Visual Commonsense Reasoning

To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.

ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning

Experimental results demonstrate that multitask models that incorporate the hierarchical structure of if-then relation types lead to more accurate inference compared to models trained in isolation, as measured by both automatic and human evaluation.


It is shown how the combined strength and wisdom of the crowds can be used to generate a large, high‐quality, word–emotion and word–polarity association lexicon quickly and inexpensively.

SQuAD: 100,000+ Questions for Machine Comprehension of Text

A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).

HellaSwag: Can a Machine Really Finish Your Sentence?

The construction of HellaSwag, a new challenge dataset, and its resulting difficulty, sheds light on the inner workings of deep pretrained models, and suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

Tackling the Story Ending Biases in The Story Cloze Test

A new crowdsourcing scheme is designed that creates a new SCT dataset that overcomes some of the biases and benchmarked a few models on the new dataset, showing that the top-performing model on the original SCT datasets fails to keep up its performance.

Commonsense Causal Reasoning between Short Texts

A framework that automatically harvests a network of causal-effect terms from a large web corpus is proposed that outperforms all previously reported results in the standard SE-MEVAL COPA task by substantial margins.