Generative Data Augmentation for Commonsense Reasoning

  title={Generative Data Augmentation for Commonsense Reasoning},
  author={Yiben Yang and Chaitanya Malaviya and Jared Fernandez and Swabha Swayamdipta and Ronan Le Bras and Ji-ping Wang and Chandra Bhagavatula and Yejin Choi and Doug Downey},
  journal={Findings of the Association for Computational Linguistics: EMNLP 2020},
Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit on. We investigate G-DAUG^C, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. Our approach generates synthetic examples using… 

A Survey on Data Augmentation for Text Classification

This survey is concerned with data augmentation methods for textual classification and aims to provide a concise and comprehensive overview for researchers and practitioners.

Generate, Annotate, and Learn: NLP with Synthetic Text

GAL achieves new state-of-the-art knowledge distillation results for 6-layer transformers on the GLUE leaderboard and investigates key components of GAL and presents theoretical and empirical arguments against the use of class-conditional LMs.

An Empirical Survey of Data Augmentation for Limited Data Learning in NLP

An empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting is provided, summarizing the landscape of methods and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks.

Learning to Infer from Unlabeled Data: A Semi-supervised Learning Approach for Robust Natural Language Inference

A novel way to incorporate unlabeled data in Semi-supervised learning for NLI is proposed where a conditional language model, BART is used to generate the hypotheses for the unlabeling sentences (used as premises).

Counterfactual Data Augmentation via Perspective Transition for Open-Domain Dialogues

Experimental results show that the proposed data augmentation method can augment high-quality responses with different semantics for a given dialogue history, and can outperform compet-itive baselines on multiple downstream tasks.

Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding

This work explores few-shot data augmentation for dialogue understanding by prompting large pre-trained language models and presents a novel approach that iterates on augmentation quality by applying weakly-supervised agents.

Leveraging Large Language Models for Multiple Choice Question Answering

It is shown that a model with high MCSB ability performs much better with the natural approach than with the traditional approach across 20 diverse datasets and largely closes the gap with the SOTA, suggesting that the MCQA ability of LLMs has been previously underestimated.

State-of-the-art generalisation research in NLP: a taxonomy and review

A taxonomy for characterising and understanding generalisation research in NLP is presented, a taxonomy is used to present a comprehensive map of published generalisation studies, and recommendations for which areas might deserve attention in the future are made.

Reweighting Strategy Based on Synthetic Data Identification for Sentence Similarity

A novel approach that first trains the classifier to measure the importance of each sentence, which is used to train a reliable sentence embedding model and demonstrates that the model trained on synthetic data generalizes well and outperforms the baselines.

Low-Resource Dense Retrieval for Open-Domain Question Answering: A Comprehensive Survey

A thorough structured overview of mainstream techniques for low-resource DR, dividing the techniques into three main categories based on their required resources, and highlighting the open issues and pros and cons.



PyHessian: Neural Networks Through the Lens of the Hessian

PYHESSIAN, a new scalable framework that enables fast computation of Hessian (i.e., second-order derivative) information for deep neural networks, shows new finer-scale insights, demonstrating that while conventional wisdom is sometimes validated, in other cases it is simply incorrect.

Association for Computational Linguistics. formers: State-of-the-art natural language processing

  • Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP
  • 2019

Is BERT Really Robust? Natural Language Attack on Text Classification and Entailment

The TextFooler is presented, a general attack framework, to generate natural adversarial texts that outperforms state-of-the-art attacks in terms of success rate and perturbation rate.

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

This work presents CommonsenseQA: a challenging new dataset for commonsense question answering, which extracts from ConceptNet multiple target concepts that have the same semantic relation to a single source concept.

A large annotated corpus for learning natural language inference

The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

CODAH: An adversarially-authored question answering dataset for common sense

  • In Proceedings of the 3rd Workshop on Evaluating Vector Space Representations
  • 2019

WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale

This work introduces WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset, and establishes new state-of-the-art results on five related benchmarks.

HellaSwag: Can a Machine Really Finish Your Sentence?

The construction of HellaSwag, a new challenge dataset, and its resulting difficulty, sheds light on the inner workings of deep pretrained models, and suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

FreeLB: Enhanced Adversarial Training for Natural Language Understanding

A novel adversarial training algorithm is proposed, FreeLB, that promotes higher invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.