GenAug: Data Augmentation for Finetuning Text Generators

  title={GenAug: Data Augmentation for Finetuning Text Generators},
  author={Steven Y. Feng and Varun Gangal and Dongyeop Kang and T. Mitamura and E. Hovy},
In this paper, we investigate data augmentation for text generation, which we call GenAug. Text generation and language modeling are important tasks within natural language processing, and are especially challenging for low-data regimes. We propose and evaluate various augmentation methods, including some that incorporate external knowledge, for finetuning GPT-2 on a subset of Yelp Reviews. We also examine the relationship between the amount of augmentation and the quality of the generated text… Expand
SAPPHIRE: Approaches for Enhanced Concept-to-Text Generation
We motivate and propose a suite of simple but effective improvements for concept-to-text generation called SAPPHIRE: Set Augmentation and Post-hoc PHrase Infilling and REcombination. We demonstrateExpand
An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
An empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting is provided, summarizing the landscape of methods and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks. Expand
Improving Automated Evaluation of Open Domain Dialog via Diverse Reference Augmentation
This work proposes a novel technique for automatically expanding a human generated reference to a set of candidate references, and fetch plausible references from knowledge sources, and adapt them so that they are more fluent in context of the dialog instance in question. Expand
A Survey on Data Augmentation for Text Classification
This survey is concerned with data augmentation methods for textual classification and aims to achieve a concise and comprehensive overview for researchers and practitioners. Expand
A Survey of Data Augmentation Approaches for NLP
This paper introduces and motivate data augmentation for NLP, and then discusses major methodologically representative approaches, and highlights techniques that are used for popular NLP applications and tasks. Expand
Generating Fake Cyber Threat Intelligence Using Transformer-Based Models
It is shown that given an initial prompt sentence, a public language model like GPT-2 with fine-tuning, can generate plausible CTI text with the ability of corrupting cyber-defense systems, and professional threat hunters were equally likely to consider the fake generated CTI as true. Expand
NAREOR: The Narrative Reordering Problem
This work presents a dataset, NAREORC, with over 1000 human rewritings of stories within ROCStories in non-linear orders, and proposes novel initial task-specific training methods and evaluation metrics. Expand


Do Massively Pretrained Language Models Make Better Storytellers?
It is found that although GPT2-117 conditions more strongly on context, is more sensitive to ordering of events, and uses more unusual words, it is just as likely to produce repetitive and under-diverse text when using likelihood-maximizing decoding algorithms. Expand
Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations
This work retrofit a language model with a label-conditional architecture, which allows the model to augment sentences without breaking the label-compatibility and improves classifiers based on the convolutional or recurrent neural networks. Expand
The Curious Case of Neural Text Degeneration
By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence. Expand
EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
EDA consists of four simple but powerful operations: synonym replacement, random insertion, random swap, and random deletion, which shows that EDA improves performance for both convolutional and recurrent neural networks. Expand
Good-Enough Compositional Data Augmentation
A simple data augmentation protocol aimed at providing a compositional inductive bias in conditional and unconditional sequence models that reduces error rate by as much as 87% on diagnostic tasks from the SCAN dataset and 16% on a semantic parsing task. Expand
Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings
This work considers the problem of learning general-purpose, paraphrastic sentence embeddings, revisiting the setting of Wieting et al. (2016b), and presents several developments that together produce the opposite conclusion. Expand
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
BERTScore: Evaluating Text Generation with BERT
This work proposes BERTScore, an automatic evaluation metric for text generation that correlates better with human judgments and provides stronger model selection performance than existing metrics. Expand
A Diversity-Promoting Objective Function for Neural Conversation Models
This work proposes using Maximum Mutual Information (MMI) as the objective function in neural models, and demonstrates that the proposed MMI models produce more diverse, interesting, and appropriate responses, yielding substantive gains in BLEU scores on two conversational datasets and in human evaluations. Expand
Synthetic and Natural Noise Both Break Neural Machine Translation
It is found that a model based on a character convolutional neural network is able to simultaneously learn representations robust to multiple kinds of noise, including structure-invariant word representations and robust training on noisy texts. Expand