• Corpus ID: 244346011

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

  title={To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP},
  author={Gozde Gul cSahin},
Data-hungry deep neural networks have established themselves as the defacto standard for many NLP tasks including the traditional sequence tagging ones. Despite their state-of-the-art performance on high-resource languages, they still fall behind of their statistical counter-parts in low-resource scenarios. One methodology to counter attack this problem is text augmentation, i.e., generating new synthetic training data points from existing data. Although NLP has recently witnessed a load of… 


Data Augmentation via Dependency Tree Morphing for Low-Resource Languages
It is shown that crop and rotate provides improvements over the models trained with non-augmented data for majority of the languages, especially for languages with rich case marking systems.
DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks
To generate high quality synthetic data for low-resource tagging tasks, a novel augmentation method with language models trained on the linearized labeled sentences is proposed that can consistently outperform the baselines.
SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation
An extremely simple data augmentation strategy for NMT: randomly replacing words in both the source sentence and the target sentence with other random words from their corresponding vocabularies is proposed.
Do Not Have Enough Data? Deep Learning to the Rescue!
This work uses a powerful pre-trained neural network model to artificially synthesize new labeled data for supervised learning and shows that LAMBADA improves classifiers' performance on a variety of datasets.
GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation
This paper proposes a novel data augmentation technique that leverages large-scale language models to generate realistic text samples from a mixture of real samples, and utilizes soft-labels predicted by the language models, effectively distilling knowledge from the large- scale language models and creating textual perturbations simultaneously.
GenAug: Data Augmentation for Finetuning Text Generators
This paper proposes and evaluates various augmentation methods, including some that incorporate external knowledge, for finetuning GPT-2 on a subset of Yelp Reviews, and examines the relationship between the amount of augmentation and the quality of the generated text.
Soft Contextual Data Augmentation for Neural Machine Translation
This work softly augments a randomly chosen word in a sentence by its contextual mixture of multiple related words, replacing the one-hot representation of a word by a distribution (provided by a language model) over the vocabulary.
Augmenting Data with Mixup for Sentence Classification: An Empirical Study
Two strategies for the adaption of Mixup on sentence classification are proposed: one performs interpolation on word embeddings and another on sentence embedDings, and both serve as an effective, domain independent data augmentation approach for sentence classification.
An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
An empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting is provided, summarizing the landscape of methods and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks.
XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering
While natural language processing systems often focus on a single language, multilingual transfer learning has the potential to improve performance, especially for low-resource languages. We