Unsupervised Cross-lingual Representation Learning at Scale

  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco Guzm{\'a}n and Edouard Grave and Myle Ott and Luke Zettlemoyer and Veselin Stoyanov},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4… 

On Learning Universal Representations Across Languages

Hierarchical Contrastive Learning (HiCTL) is proposed to learn universal representations for parallel sentences distributed in one or multiple languages and distinguish the semantically-related words from a shared cross-lingual vocabulary for each sentence.

Cross-Lingual Retrieval Augmented Prompt for Low-Resource Languages

A robustness analysis suggests that PARC has the potential to achieve even stronger performance with more powerful MPLMs, and positive correlation between cross-lingual transfer performance on one side, and the similarity between the high- and low- resource languages as well as the amount of low-resource pretraining data on the other side.

Model Selection for Cross-lingual Transfer

It is shown that it is possible to select consistently better models when small amounts of annotated data are available in auxiliary pivot languages, and a machine learning approach to model selection is proposed that uses the fine-tuned model’s own internal representations to predict its cross-lingual capabilities.

Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval

This work presents a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a large number of language pairs, and indicates that for unsupervised document-level CLIR – a setup in which there are no relevance judgments for task-specific fine-tuning – the pretrained encoder fail to significantly outperform models based on CLWEs.

On cross-lingual retrieval with multilingual text encoders

The results indicate that for unsupervised document-level CLIR,pretrained multilingual encoders on average fail to significantly outperform earlier models based on CLWEs, and point to “monolingual overfitting” of retrieval models trained on monolingual (English) data, even if they are based on multilingual transformers.

Multi-level Distillation of Semantic Knowledge for Pre-training Multilingual Language Model

Multi-level Multilingual Knowledge Distillation (MMKD) is proposed, a novel method for improving multilingual language models that employs a teacher-student framework to adopt rich semantic representation knowledge in English BERT.

Model Selection for Cross-Lingual Transfer using a Learned Scoring Function

A meta-learning approach to model selection that uses the fine-tuned model's own internal representations to predict its cross-lingual capabilities is proposed, finding that this approach consistently selects better models than English validation data across five languages and five well-studied NLP tasks.

Subword Evenness (SuE) as a Predictor of Cross-lingual Transfer to Low-resource Languages

It is shown that languages written in non-Latin and non-alphabetic scripts (mostly Asian languages) are the best choices for improving performance on the task of Masked Language Modelling (MLM) in a diverse set of 30 low-resource languages and that the success of the transfer is well predicted by the authors' novel measure of Subword Evenness (SuE).

Analyzing BERT Cross-lingual Transfer Capabilities in Continual Sequence Labeling

It is found that lost performance can be recovered with as little as a single training epoch even if forgetting was high, which can be explained by a progressive shift of model parameters towards a better multilingual initialization.

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

The Cross-lingual TRansfer Evaluation of Multilingual Encoders XTREME benchmark is introduced, a multi-task benchmark for evaluating the cross-lingually generalization capabilities of multilingual representations across 40 languages and 9 tasks.



Cross-lingual Language Model Pretraining

This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.

XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering

While natural language processing systems often focus on a single language, multilingual transfer learning has the potential to improve performance, especially for low-resource languages. We

Emerging Cross-lingual Structure in Pretrained Language Models

It is shown that transfer is possible even when there is no shared vocabulary across the monolingual corpora and also when the text comes from very different domains, and it is strongly suggested that, much like for non-contextual word embeddings, there are universal latent symmetries in the learned embedding spaces.

Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks

It is found that doing fine-tuning on multiple languages together can bring further improvement in Unicoder, a universal language encoder that is insensitive to different languages.

Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation

Evaluating the cross-lingual effectiveness of representations from the encoder of a massively multilingual NMT model on 5 downstream classification and sequence labeling tasks covering a diverse set of over 50 languages shows gains in zero-shot transfer in 4 out of 5 tasks.

MLQA: Evaluating Cross-lingual Extractive Question Answering

This work presents MLQA, a multi-way aligned extractive QA evaluation benchmark intended to spur research in this area, and evaluates state-of-the-art cross-lingual models and machine-translation-based baselines onMLQA.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

Improving Language Understanding by Generative Pre-Training

The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.