Bilingual alignment transfers to multilingual alignment for unsupervised parallel text mining

  title={Bilingual alignment transfers to multilingual alignment for unsupervised parallel text mining},
  author={Chih-chan Tien and Shane Steinert-Threlkeld},
This work presents methods for learning cross-lingual sentence representations using paired or unpaired bilingual texts. We hypothesize that the cross-lingual alignment strategy is transferable, and therefore a model trained to align only two languages can encode multilingually more aligned representations. We thus introduce dual-pivot transfer: training on one language pair and evaluating on other pairs. To study this theory, we design unsupervised models trained on unpaired sentences and… 

Figures and Tables from this paper


Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
An architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts using a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora.
Unsupervised Cross-lingual Transfer of Word Embedding Spaces
Cross-lingual transfer of word embeddings aims to establish the semantic mappings among words in different languages by learning the transformation functions over the corresponding word embedding
Word Translation Without Parallel Data
It is shown that a bilingual dictionary can be built between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way.
Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining
A novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data is proposed and it is observed that a single synthetic bilingual corpus is able to improve results for other language pairs.
Emerging Cross-lingual Structure in Pretrained Language Models
It is shown that transfer is possible even when there is no shared vocabulary across the monolingual corpora and also when the text comes from very different domains, and it is strongly suggested that, much like for non-contextual word embeddings, there are universal latent symmetries in the learned embedding spaces.
Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings
It is demonstrated that unsupervised bitext mining is an effective way of augmenting MT datasets and complements existing techniques like initializing with pre-trained contextual embeddings.
Unsupervised Machine Translation Using Monolingual Corpora Only
This work proposes a model that takes sentences from monolingual corpora in two different languages and maps them into the same latent space and effectively learns to translate without using any labeled data.
Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation
An easy and efficient method to extend existing sentence embedding models to new languages by using the original (monolingual) model to generate sentence embeddings for the source language and then training a new system on translated sentences to mimic the original model.
Cross-lingual Language Model Pretraining
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.
Unsupervised Parallel Sentence Extraction from Comparable Corpora
This work presents a simple approach relying on bilingual word embeddings trained in an unsupervised fashion, which incorporates orthographic similarity in order to handle words with similar surface forms and proposes a dynamic threshold method to decide if a candidate sentence-pair is parallel.