Sequence-level Mixed Sample Data Augmentation

@inproceedings{Guo2020SequencelevelMS,
  title={Sequence-level Mixed Sample Data Augmentation},
  author={Demi Guo and Yoon Kim and Alexander M. Rush},
  booktitle={EMNLP},
  year={2020}
}
Despite their empirical success, neural networks still have difficulty capturing compositional aspects of natural language. This work proposes a simple data augmentation approach to encourage compositional behavior in neural models for sequence-to-sequence problems. Our approach, SeqMix, creates new synthetic examples by softly combining input/output sequences from the training set. We connect this approach to existing techniques such as SwitchOut and word dropout, and show that these… 

Tables from this paper

Improving Compositional Generalization with Latent Structure and Data Augmentation
TLDR
This work presents a more powerful data recombination method using a model called CSL, a generative model with a quasi-synchronous context-free grammar backbone, which results in a model even stronger than a T5-CSL ensemble on two real world compositional generalization tasks.
GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation
TLDR
This paper proposes a novel data augmentation technique that leverages large-scale language models to generate realistic text samples from a mixture of real samples, and utilizes soft-labels predicted by the language models, effectively distilling knowledge from the large- scale language models and creating textual perturbations simultaneously.
FlipDA: Effective and Robust Data Augmentation for Few-Shot Learning
TLDR
This work proposes a novel data augmentation method FlipDA that jointly uses a generative model and a classifier to generate label-flipped data and achieves a good tradeoff between effectiveness and robustness.
To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP
TLDR
Three categories of text augmentation methodologies which perform changes on the syntax, token and character levels are investigated, finding the experimented techniques to be effective on morphologically rich languages in general rather than analytic languages such as Vietnamese.
To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP
TLDR
Three categories of text augmentation methodologies which perform changes on the syntax, token and character levels are investigated, finding the experimented techniques to be effective on morphologically rich languages in general rather than analytic languages such as Vietnamese.
Rethinking Data Augmentation for Low-Resource Neural Machine Translation: A Multi-Task Learning Approach
TLDR
This paper proposes a multi-task DA approach in which new sentence pairs with transformations, such as reversing the order of the target sentence, which produce unfluent target sentences, and shows consistent improvements over the baseline and over DA methods aiming at extending the support of the empirical data distribution.
Substructure Substitution: Structured Data Augmentation for NLP
TLDR
This work studies a family of data augmentation methods, substructure substitution (SUB), that generalizes prior methods, and presents variations of SUB based on text spans or parse trees, introducing structureaware data augmented methods to general NLP tasks.
Sequence-to-Sequence Learning with Latent Neural Grammars
TLDR
This work develops a neural parameterization of the grammar which enables parameter sharing over the combinatorial space of derivation rules without the need for manual feature engineering, and applies it to a diagnostic language navigation task and to small-scale machine translation.
Finding needles in a haystack: Sampling Structurally-diverse Training Sets from Synthetic Data for Compositional Generalization
TLDR
This work investigates automatic generation of synthetic utterance-program pairs for improving compositional generalization in semantic parsing and selects a subset of synthetic examples that are structurally-diverse and uses them to improve compositionalgeneralization.
Self-supervised and Supervised Joint Training for Resource-rich Machine Translation
TLDR
A joint training approach to combine self-supervised and supervised learning to optimize NMT models, F2-XEnDec, which achieves substantial improvements over several strong baseline methods and obtains a new state of the art of 46.19 BLEU on English-French when incorporating back translation.
...
1
2
3
...

References

SHOWING 1-10 OF 23 REFERENCES
Good-Enough Compositional Data Augmentation
TLDR
A simple data augmentation protocol aimed at providing a compositional inductive bias in conditional and unconditional sequence models that reduces error rate by as much as 87% on diagnostic tasks from the SCAN dataset and 16% on a semantic parsing task.
SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation
TLDR
An extremely simple data augmentation strategy for NMT: randomly replacing words in both the source sentence and the target sentence with other random words from their corresponding vocabularies is proposed.
Soft Contextual Data Augmentation for Neural Machine Translation
TLDR
This work softly augments a randomly chosen word in a sentence by its contextual mixture of multiple related words, replacing the one-hot representation of a word by a distribution (provided by a language model) over the vocabulary.
Latent Alignment and Variational Attention
TLDR
Variational attention networks are considered, alternatives to soft and hard attention for learning latent variable alignment models, with tighter approximation bounds based on amortized variational inference, and methods for reducing the variance of gradients are proposed to make these approaches computationally feasible.
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
TLDR
A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network.
Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks
TLDR
This paper introduces the SCAN domain, consisting of a set of simple compositional navigation commands paired with the corresponding action sequences, and tests the zero-shot generalization capabilities of a variety of recurrent neural networks trained on SCAN with sequence-to-sequence methods.
Neural Module Networks
TLDR
A procedure for constructing and learning neural module networks, which compose collections of jointly-trained neural "modules" into deep networks for question answering, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.).
MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification
TLDR
By mixing labeled, unlabeled and augmented data, MixText significantly outperformed current pre-trained and fined-tuned models and other state-of-the-art semi-supervised learning methods on several text classification benchmarks.
mixup: Beyond Empirical Risk Minimization
TLDR
This work proposes mixup, a simple learning principle that trains a neural network on convex combinations of pairs of examples and their labels, which improves the generalization of state-of-the-art neural network architectures.
fairseq: A Fast, Extensible Toolkit for Sequence Modeling
TLDR
Fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks and supports distributed training across multiple GPUs and machines.
...
1
2
3
...