Majority Voting with Bidirectional Pre-translation For Bitext Retrieval

  title={Majority Voting with Bidirectional Pre-translation For Bitext Retrieval},
  author={Alex Jones and D. Wijaya},
Obtaining high-quality parallel corpora is of paramount importance for training NMT systems. However, as many language pairs lack adequate gold-standard training data, a popular approach has been to mine so-called “pseudo-parallel” sentences from paired documents in two languages. In this paper, we outline some drawbacks with current methods that rely on an embedding similarity threshold, and propose a heuristic method in its place. Our method involves translating both halves of a paired corpus… 

Figures and Tables from this paper

AugCSE: Contrastive Sentence Embedding with Diverse Augmentations

AugCSE is presented, a unified framework to utilize diverse sets of data augmentations to achieve a better, general-purpose, sentence embedding model, and shows that diverse augmentations can be tamed to produce a better and more robust sentence representation.

Better Quality Estimation for Low Resource Corpus Mining

This work proposes a combination of multitask training, data augmentation and contrastive learning to achieve better and more robust QE performance and increases the accuracy in PCM by more than 0.80, making it on par with state-of-the-art PCM methods that use millions of sentence pairs to train their models.

A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space

The results of the analyses show that word order agreement and agreement in morphological complexity are two of the strongest linguistic predictors of cross-linguality.

"Wikily" Neural Machine Translation Tailored to Cross-Lingual Tasks

We present a simple but effective approach for leveraging Wikipedia for neural machine translation as well as cross-lingual tasks of image captioning and dependency parsing without using any direct

“Wikily” Supervised Neural Translation Tailored to Cross-Lingual Tasks

It is shown that first sentences and titles of linked Wikipedia pages, as well as cross-lingual image captions, are strong signals for a seed parallel data to extract bilingual dictionaries and cross-lingsual word embeddings for mining parallel text from Wikipedia.



Language-agnostic BERT Sentence Embedding

It is shown that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%, and a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba is released.

Cross-lingual Retrieval for Iterative Self-Supervised Training

This work found that the cross-lingual alignment can be further improved by training seq2seq models on sentence pairs mined using their own encoder outputs, and developed a new approach -- cross- Lingual retrieval for iterative self-supervised training (CRISS), where mining and training processes are applied iteratively, improving cross-lingsual alignment and translation ability at the same time.

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web

It is shown that margin-based bitext mining in a multilingual sentence space can be successfully scaled to operate on monolingual corpora of billions of sentences and set a new state of the art for a single system on the WMT’19 test set for English-German/Russian/Chinese.

Unsupervised Cross-lingual Representation Learning at Scale

It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Billion-Scale Similarity Search with GPUs

This paper proposes a novel design for an inline-formula that enables the construction of a high accuracy, brute-force, approximate and compressed-domain search based on product quantization, and applies it in different similarity search scenarios.

Low-Resource Machine Translation for Low-Resource Languages: Leveraging Comparable Data, Code-Switching and Compute Resources

This work proposes a simple and scalable method to improve unsupervised NMT, showing how adding comparable data mined using a bilingual dictionary along with modest additional compute resource to train the model can significantly improve its performance.

Extracting Parallel Sentences from Comparable Corpora with STACC Variants

This approach is based on variants of the STACC method, which computes similarity on expanded lexical sets via Jaccard similarity, and applies the weighted variant of the method to all four language pairs of the task, demonstrating theency and portability of the approach.

Beyond English-Centric Multilingual Machine Translation

This work creates a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages and explores how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models.

Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings

It is demonstrated that unsupervised bitext mining is an effective way of augmenting MT datasets and complements existing techniques like initializing with pre-trained contextual embeddings.