UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark

  title={UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark},
  author={Nicholas Lourie and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi},
Commonsense AI has long been seen as a near impossible goal---until recently. Now, research interest has sharply increased with an influx of new benchmarks and models. We propose two new ways to evaluate commonsense models, emphasizing their generality on new tasks and building on diverse, recently introduced benchmarks. First, we propose a new multitask benchmark, Rainbow, to promote research on commonsense models that generalize well over multiple tasks and datasets. Second, we propose a… 

Shortcutted Commonsense: Data Spuriousness in Deep Learning of Commonsense Reasoning

A study on different prominent benchmarks that involve commonsense reasoning, along a number of key stress experiments, thus seeking to gain insight on whether the models are learning transferable generalizations intrinsic to the problem at stake or just taking advantage of incidental shortcuts in the data items.

Analyzing Commonsense Emergence in Few-shot Knowledge Models

The results show that commonsense knowledge models can rapidly adapt from limited examples, indicating that KG fine-tuning serves to learn an interface to encoded knowledge learned during pretraining.

An Empirical Investigation of Commonsense Self-Supervision with Knowledge Graphs

The effect of different synthetic datasets on language models with various architectures and sizes is studied to show that encoder-decoder models benefit from more data to learn from, whereas sampling strategies that balance across different aspects yield best performance.

Modularized Transfer Learning with Multiple Knowledge Graphs for Zero-shot Commonsense Reasoning

This paper proposes to mitigate the loss of knowledge from the interference among the different knowledge sources, by developing a modular variant of the knowledge aggregation as a new zero-shot commonsense reasoning framework.

Do Language Models Learn Commonsense Knowledge?

Language models (LMs) trained on large amounts of data (e.g., Brown et al., 2020; Patwary et al., 2021) have shown impressive performance on many NLP tasks under the zero-shot and few-shot setup.

NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks

NumGLUE is proposed, a multi-task benchmark that evaluates the performance of AI systems on eight different tasks, that at their core require simple arithmetic understanding and it is shown that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models performing significantly worse than humans.

SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer

It is shown that SPoT significantly boosts the performance of Prompt Tuning across many tasks, and an efficient retrieval approach is proposed that interprets task prompts as task embeddings to identify similar tasks and predict the most transferable source tasks for a novel target task.

ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning

ExMIX (Extreme Mixture): a massive collection of 107 supervised NLP tasks across diverse domains and task-families is introduced, and a model pre-trained using a multi-task objective of self-supervised span denoising and supervised EXMIX is proposed.

Embarrassingly Simple Performance Prediction for Abductive Natural Language Inference

This work proposes a simple method for predicting the performance without actually fine-tuning the model of a natural language inference model, and shows that the accuracy of the cosine similarity approach correlates strongly with theuracy of the classification approach with a Pearson correlation coefficient of 0.65.

Semantic Categorization of Social Knowledge for Commonsense Question Answering

This work proposed to categorize the semantics needed for these QA tasks using the SocialIQA as an example, and further train neural QA models to incorporate such social knowledge categories and relation information from a knowledge base.



A Simple Method for Commonsense Reasoning

Key to this method is the use of language models, trained on a massive amount of unlabled data, to score multiple choice questions posed by commonsense reasoning tests, which outperform previous state-of-the-art methods by a large margin.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

This work presents CommonsenseQA: a challenging new dataset for commonsense question answering, which extracts from ConceptNet multiple target concepts that have the same semantic relation to a single source concept.

Towards Generalizable Neuro-Symbolic Systems for Commonsense Question Answering

This paper performs a survey of recent commonsense QA methods and provides a systematic analysis of popular knowledge resources and knowledge-integration methods, across benchmarks from multiple commonsense datasets, and shows that attention-based injection seems to be a preferable choice for knowledge integration.

Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?

It is observed that intermediate tasks requiring high-level inference and reasoning abilities tend to work best and that target task performance is strongly correlated with higher-level abilities such as coreference resolution, but it is failed to observe more granular correlations between probing and target taskperformance.

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT.

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

PIQA: Reasoning about Physical Commonsense in Natural Language

The task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA are introduced and analysis about the dimensions of knowledge that existing models lack are provided, which offers significant opportunities for future research.

WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale

This work introduces WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset, and establishes new state-of-the-art results on five related benchmarks.