Generative Data Augmentation for Commonsense Reasoning

@article{Yang2020GenerativeDA,
  title={Generative Data Augmentation for Commonsense Reasoning},
  author={Yiben Yang and Chaitanya Malaviya and Jared Fernandez and Swabha Swayamdipta and Ronan Le Bras and Ji-ping Wang and Chandra Bhagavatula and Yejin Choi and Doug Downey},
  journal={Findings of the Association for Computational Linguistics: EMNLP 2020},
  year={2020}
}
Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit on. We investigate G-DAUG^C, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. Our approach generates synthetic examples using… 

A Survey on Data Augmentation for Text Classification

TLDR
This survey is concerned with data augmentation methods for textual classification and aims to provide a concise and comprehensive overview for researchers and practitioners.

Generate, Annotate, and Learn: NLP with Synthetic Text

TLDR
GAL achieves new state-of-the-art knowledge distillation results for 6-layer transformers on the GLUE leaderboard and investigates key components of GAL and presents theoretical and empirical arguments against the use of class-conditional LMs.

An Empirical Survey of Data Augmentation for Limited Data Learning in NLP

TLDR
An empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting is provided, summarizing the landscape of methods and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks.

Reweighting Strategy based on Synthetic Data Identification for Sentence Similarity

TLDR
This work proposes a novel approach that trains a classifier to measure the importance of each sentence, and distilled information is used to train a reliable sentence embedding model that generalizes well and outperforms the existing baselines.

Low-Resource Dense Retrieval for Open-Domain Question Answering: A Comprehensive Survey

TLDR
A thorough structured overview of mainstream techniques for low-resource DR, dividing the techniques into three main categories based on their required resources, and highlighting the open issues and pros and cons.

DictBERT: Dictionary Description Knowledge Enhanced Language Model Pre-training via Contrastive Learning

TLDR
DictBERT is proposed, a novel approach that enhances PLMs with dictionary knowledge which is easier to acquire than knowledge graph (KG) and can significantly improve typical PLMs.

Building Korean Sign Language Augmentation (KoSLA) Corpus with Data Augmentation Technique

TLDR
The effectiveness of data augmentation technique and usefulness of the corpus are verified by performing a translation task between normal sentences and sign language annotations on two tokenizers, proving that the BLEU scores with the KoSLA corpus were significant.

A survey of methods for revealing and overcoming weaknesses of data-driven Natural Language Understanding

TLDR
This structured survey provides an overview of the evolving research area by categorising reported weaknesses in models and datasets and the methods proposed to reveal and alleviate those weaknesses for the English language.

Data Augmentation for Biomedical Factoid Question Answering

TLDR
It is shown that DA can lead to very significant performance gains, even when using large pre-trained Transformers, contributing to a broader discussion of if/when DA benefits large pre -trained models.

True Few-Shot Learning with Prompts—A Real-World Perspective

TLDR
An extensive study of Pet, a method that combines textual instructions with example-based finetuning, shows that, if correctly configured, Pet performs strongly in true few-shot settings without a dev set and underpin the belief that learning from instructions will play an important role on the path towards human-like few- shot learning capabilities.

References

SHOWING 1-10 OF 70 REFERENCES

PyHessian: Neural Networks Through the Lens of the Hessian

TLDR
PYHESSIAN, a new scalable framework that enables fast computation of Hessian (i.e., second-order derivative) information for deep neural networks, shows new finer-scale insights, demonstrating that while conventional wisdom is sometimes validated, in other cases it is simply incorrect.

Association for Computational Linguistics. formers: State-of-the-art natural language processing

  • Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP
  • 2019

Is BERT Really Robust? Natural Language Attack on Text Classification and Entailment

TLDR
The TextFooler is presented, a general attack framework, to generate natural adversarial texts that outperforms state-of-the-art attacks in terms of success rate and perturbation rate.

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

TLDR
This work presents CommonsenseQA: a challenging new dataset for commonsense question answering, which extracts from ConceptNet multiple target concepts that have the same semantic relation to a single source concept.

A large annotated corpus for learning natural language inference

TLDR
The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

CODAH: An adversarially-authored question answering dataset for common sense

  • In Proceedings of the 3rd Workshop on Evaluating Vector Space Representations
  • 2019

WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale

TLDR
This work introduces WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset, and establishes new state-of-the-art results on five related benchmarks.

HellaSwag: Can a Machine Really Finish Your Sentence?

TLDR
The construction of HellaSwag, a new challenge dataset, and its resulting difficulty, sheds light on the inner workings of deep pretrained models, and suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

FreeLB: Enhanced Adversarial Training for Natural Language Understanding

TLDR
A novel adversarial training algorithm is proposed, FreeLB, that promotes higher invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

TLDR
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
...