• Corpus ID: 216144451

G-DAug: Generative Data Augmentation for Commonsense Reasoning

@inproceedings{Yang2020GDAugGD,
  title={G-DAug: Generative Data Augmentation for Commonsense Reasoning},
  author={Yiben Yang and Chaitanya Malaviya and Jared Fernandez and Swabha Swayamdipta and Ronan Le Bras and Ji-ping Wang and Chandra Bhagavatula and Yejin Choi and Doug Downey},
  booktitle={FINDINGS},
  year={2020}
}
Recent advances in commonsense reasoning depend on large-scale human-annotated training sets to achieve peak performance. However, manual curation of training sets is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit to. We propose a novel generative data augmentation technique, G-DAUGˆC, that aims to achieve more accurate and robust learning in a low-resource setting. Our approach generates synthetic examples using pretrained… 
GraDA: Graph Generative Data Augmentation for Commonsense Reasoning
TLDR
GraDA is presented, a graph-generative data augmentation framework to synthesize factual data samples from knowledge graphs for commonsense reasoning datasets and shows improvement in robustness to semantic adversaries after training with GraDA and provides human evaluation of the quality of synthetic datasets in terms of factuality and answerability.
Knowledge-driven Data Construction for Zero-shot Evaluation in Commonsense Question Answering
TLDR
A novel neuro-symbolic framework for zero-shot question answering across commonsense tasks is proposed and it is shown that, while an individual knowledge graph is better suited for specific tasks, a global knowledge graph brings consistent gains across different tasks.
Knowledge-driven Self-supervision for Zero-shot Commonsense Question Answering
TLDR
A novel neuro-symbolic framework for zero-shot question answering across commonsense tasks is proposed and it is shown that, while an individual knowledge graph is better suited for specific tasks, a global knowledge graph brings consistent gains across different tasks.
Generate, Annotate, and Learn: Generative Models Advance Self-Training and Knowledge Distillation
TLDR
A general framework called “generate, annotate, and learn (GAL)” is presented that uses unconditional generative models to synthesize in-domain unlabeled data, helping advance SSL and KD on different tasks.
An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
TLDR
An empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting is provided, summarizing the landscape of methods and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks.
Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets
TLDR
On the majority of the datasets, the method outperforms or performs comparably to previous state-of-the-art debiasing strategies, and when combined with an orthogonal technique, product- of-experts, it improves further and outperforms previous best results of SNLI-hard and MNLI- hard.
SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness
TLDR
This work introduces SSMBA, a data augmentation method for generating synthetic training examples by using a pair of corruption and reconstruction functions to move randomly on a data manifold, and investigates the use of SSM BA in the natural language domain, leveraging the manifold assumption to reconstruct corrupted text with masked language models.
Automatic Knowledge Augmentation for Generative Commonsense Reasoning
TLDR
A data-centric method that uses automatic knowledge augmentation to extend commonsense knowledge using a machine knowledge generator that can generate semi-golden sentences that improve the generative commonsense reasoning of a language model without architecture modifications is proposed.
Generate, Annotate, and Learn: NLP with Synthetic Text
TLDR
GAL achieves new state-of-the-art knowledge distillation results for 6-layer transformers on the GLUE leaderboard and presents theoretical and empirical arguments against the use of class-conditional LMs to generate synthetic labeled text instead of unlabeled text.
On Sample Based Explanation Methods for NLP: Efficiency, Faithfulness, and Semantic Evaluation
TLDR
This work can improve the interpretability of explanations by allowing arbitrary text sequences as the explanation unit, and proposes a semantic-based evaluation metric that can better align with humans’ judgment of explanations than the widely adopted diagnostic or retraining measures.
...
...

References

SHOWING 1-10 OF 75 REFERENCES
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
TLDR
This paper introduces the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and proposes Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data.
FreeLB: Enhanced Adversarial Training for Natural Language Understanding
TLDR
A novel adversarial training algorithm is proposed, FreeLB, that promotes higher invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples.
A Simple Method for Commonsense Reasoning
TLDR
Key to this method is the use of language models, trained on a massive amount of unlabled data, to score multiple choice questions posed by commonsense reasoning tests, which outperform previous state-of-the-art methods by a large margin.
Not Enough Data? Deep Learning to the Rescue!
TLDR
This work uses a powerful pre-trained neural network model to artificially synthesize new labeled data for supervised learning and shows that LAMBADA improves classifiers' performance on a variety of datasets.
Unsupervised Data Augmentation for Consistency Training
TLDR
A new perspective on how to effectively noise unlabeled examples is presented and it is argued that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning.
HellaSwag: Can a Machine Really Finish Your Sentence?
TLDR
The construction of HellaSwag, a new challenge dataset, and its resulting difficulty, sheds light on the inner workings of deep pretrained models, and suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.
WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale
TLDR
This work introduces WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset, and establishes new state-of-the-art results on five related benchmarks.
Generating Natural Adversarial Examples
TLDR
This paper proposes a framework to generate natural and legible adversarial examples that lie on the data manifold, by searching in semantic space of dense and continuous data representation, utilizing the recent advances in generative adversarial networks.
Generating Natural Language Adversarial Examples
TLDR
A black-box population-based optimization algorithm is used to generate semantically and syntactically similar adversarial examples that fool well-trained sentiment analysis and textual entailment models with success rates of 97% and 70%, respectively.
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks
TLDR
A combination of automated and human evaluations show that SCPNs generate paraphrases that follow their target specifications without decreasing paraphrase quality when compared to baseline (uncontrolled) paraphrase systems.
...
...