Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

  title={Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond},
  author={Mikel Artetxe and Holger Schwenk},
  journal={Transactions of the Association for Computational Linguistics},
  • Mikel Artetxe, Holger Schwenk
  • Published 26 December 2018
  • Computer Science, Linguistics
  • Transactions of the Association for Computational Linguistics
Abstract We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting embeddings using English annotated data only… 
Learning Contextualised Cross-lingual Word Embeddings for Extremely Low-Resource Languages Using Parallel Corpora
The model outperforms existing methods on bilingual lexicon induction and word alignment tasks and demonstrates that an encoder-decoder translation model is beneficial for learning cross-lingual representations, even in extremely low-resource scenarios.
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora
This work proposes a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus based on an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
Robust Cross-lingual Embeddings from Parallel Sentences
This work proposes a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word and sentence representations and significantly improves cross-lingsual sentence retrieval performance over all other approaches while maintaining parity with the current state-of-the-art methods on word-translation.
Explicit Alignment Objectives for Multilingual Bidirectional Encoders
A new method for learning multilingual encoders, AMBER (Aligned Multilingual Bidirectional EncodeR), trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities is presented.
Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval
This work presents a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a large number of language pairs, and indicates that for unsupervised document-level CLIR – a setup in which there are no relevance judgments for task-specific fine-tuning – the pretrained encoder fail to significantly outperform models based on CLWEs.
On cross-lingual retrieval with multilingual text encoders
The results indicate that for unsupervised document-level CLIR,pretrained multilingual encoders on average fail to significantly outperform earlier models based on CLWEs, and point to “monolingual overfitting” of retrieval models trained on monolingual (English) data, even if they are based on multilingual transformers.
On Learning Universal Representations Across Languages
Hierarchical Contrastive Learning (HiCTL) is proposed to learn universal representations for parallel sentences distributed in one or multiple languages and distinguish the semantically-related words from a shared cross-lingual vocabulary for each sentence.
Unsupervised Interlingual Semantic Representations from Sentence Embeddings for Zero-Shot Cross-Lingual Transfer
This work presents a novel architecture for training interlingual semantic representations on top of sentence embeddings in a completely unsupervised manner, and demonstrates its effectiveness in zero-shot cross-lingual transfer in natural language inference task.
A Comparison of Architectures and Pretraining Methods for Contextualized Multilingual Word Embeddings
A comprehensive comparison of state-of-the-art multilingual word and sentence encoders on the tasks of named entity recognition (NER) and part of speech (POS) tagging and proposes a new method for creating multilingual contextualized word embeddings that allows for better knowledge sharing across languages in a joint training setting.
Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages
A new teacher-student training scheme is introduced which combines supervised and self-supervised training, allowing encoders to take advantage of monolingual training data, which is valuable in the low-resource setting.


XNLI: Evaluating Cross-lingual Sentence Representations
This work constructs an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus to 14 languages, including low-resource languages such as Swahili and Urdu and finds that XNLI represents a practical and challenging evaluation suite and that directly translating the test data yields the best performance among available baselines.
An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification
This work systematically study the neural machine translation context vectors, i.e., output of the encoder, and their power as an interlingua representation of a sentence, and assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs.
Multilingual Seq2seq Training with Similarity Loss for Cross-Lingual Document Classification
This framework introduces a simple method of adding a loss to the learning objective which penalizes distance between representations of bilingually aligned sentences, and finds the similarity loss significantly improves performance on both cross-lingual transfer and document classification.
Word Translation Without Parallel Data
It is shown that a bilingual dictionary can be built between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way.
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
This work proposes a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages using a shared wordpiece vocabulary, and introduces an artificial token at the beginning of the input sentence to specify the required target language.
A Corpus for Multilingual Document Classification in Eight Languages
A new subset of the Reuters corpus with balanced class priors for eight languages is proposed, adding Italian, Russian, Japanese and Chinese, which provides strong baselines for all language transfer directions using multilingual word and sentence embeddings respectively.
Phrase-Based & Neural Unsupervised Machine Translation
This work investigates how to learn to translate when having access to only large monolingual corpora in each language, and proposes two model variants, a neural and a phrase-based model, which are significantly better than methods from the literature, while being simpler and having fewer hyper-parameters.
Cross-lingual Language Model Pretraining
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.
Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings
This system identifies parallel sentence pairs in French-English corpora by following a hybrid approach pairing multilingual sentence-level embeddings, neural machine translation, and supervised classification.
Improving Neural Machine Translation Models with Monolingual Data
This work pairs monolingual training data with an automatic back-translation, and can treat it as additional parallel training data, and obtains substantial improvements on the WMT 15 task English German, and for the low-resourced IWSLT 14 task Turkish->English.