Don’t Rule Out Monolingual Speakers: A Method For Crowdsourcing Machine Translation Data

  title={Don’t Rule Out Monolingual Speakers: A Method For Crowdsourcing Machine Translation Data},
  author={Rajat Bhatnagar and Ananya Ganesh and Katharina Kann},
High-performing machine translation (MT) systems can help overcome language barriers while making it possible for everyone to communicate and use language technologies in the language of their choice. However, such systems require large amounts of parallel sentences for training, and translators can be difficult to find and expensive. Here, we present a data collection strategy for MT which, in contrast, is cheap and simple, as it does not require bilingual speakers. Based on the insight that… 

Figures and Tables from this paper



Improving Neural Machine Translation Models with Monolingual Data

This work pairs monolingual training data with an automatic back-translation, and can treat it as additional parallel training data, and obtains substantial improvements on the WMT 15 task English German, and for the low-resourced IWSLT 14 task Turkish->English.

Unsupervised Machine Translation Using Monolingual Corpora Only

This work proposes a model that takes sentences from monolingual corpora in two different languages and maps them into the same latent space and effectively learns to translate without using any labeled data.

Parallel Corpus Filtering via Pre-trained Language Models

A novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models via Generative Pre-training (GPT) language model as a domain filter to balance data domains and achieves a new state-of-the-art.

Generalized Data Augmentation for Low-Resource Translation

This paper proposes a general framework of data augmentation for low-resource machine translation not only using target-side monolingual data, but also by pivoting through a related high-resource language.

Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation

This work builds a customized sentence segmenter for Bengali and proposes two novel methods for parallel corpus creation on low-resource setups: aligner ensembling and batch filtering, which will pave the way for future research on Bengali-English machine translation as well as other low- resource languages.

Practical Comparable Data Collection for Low-Resource Languages via Images

We propose a method of curating high-quality comparable training data for low-resource languages with monolingual annotators. Our method involves using a carefully selected set of images as a pivot

Unsupervised Parallel Corpus Mining on Web Data

A pipeline to mine the parallel corpus from the Internet in an unsupervised manner is presented and the machine translator trained with the data extracted by the pipeline achieves very close performance to the supervised results.

Data Augmentation for Low-Resource Neural Machine Translation

A novel data augmentation approach that targets low-frequency words by generating new sentence pairs containing rare words in new, synthetically created contexts that improves translation quality on simulated low-resource settings.

Bleu: a Method for Automatic Evaluation of Machine Translation

This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

Dual Subtitles as Parallel Corpora

This paper presents a simple heuristic to detect and extract dual subtitles and shows that more than 20 million sentence pairs can be extracted for the Mandarin-English language pair.