XNLI: Evaluating Cross-lingual Sentence Representations

  title={XNLI: Evaluating Cross-lingual Sentence Representations},
  author={Alexis Conneau and Guillaume Lample and Ruty Rinott and Adina Williams and Samuel R. Bowman and Holger Schwenk and Veselin Stoyanov},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. [] Key Method In addition, we provide several baselines for multilingual sentence understanding, including two based on machine translation systems, and two that use parallel data to train aligned multilingual bag-of-words and LSTM encoders. We find that XNLI represents a practical and challenging evaluation suite, and that directly translating the test data yields the best…

Figures and Tables from this paper

XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering

While natural language processing systems often focus on a single language, multilingual transfer learning has the potential to improve performance, especially for low-resource languages. We

Cross-lingual Transfer Learning for Multilingual Task Oriented Dialog

This paper presents a new data set of 57k annotated utterances in English, Spanish, Spanish and Thai and uses this data set to evaluate three different cross-lingual transfer methods, finding that given several hundred training examples in the the target language, the latter two methods outperform translating the training data.

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

An architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts using a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora.

On Learning Universal Representations Across Languages

Hierarchical Contrastive Learning (HiCTL) is proposed to learn universal representations for parallel sentences distributed in one or multiple languages and distinguish the semantically-related words from a shared cross-lingual vocabulary for each sentence.

Cross-Lingual Natural Language Generation via Pre-Training

Experimental results on question generation and abstractive summarization show that the model outperforms the machine-translation-based pipeline methods for zero-shot cross-lingual generation and improves NLG performance of low-resource languages by leveraging rich-resource language data.

A Deep Transfer Learning Method for Cross-Lingual Natural Language Inference

It is argued that the transfer learning-based loss objective is model agnostic and thus can be used with other deep learning- based architectures for cross-lingual NLI.

Probing Multilingual Language Models for Discourse

It is found that the XLM-RoBERTa family of models consistently show the best performance, by simultaneously being good monolingual models and degrading relatively little in a zero-shot setting.

Cross-lingual Language Model Pretraining

This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.

Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval

This work presents a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a large number of language pairs, and indicates that for unsupervised document-level CLIR – a setup in which there are no relevance judgments for task-specific fine-tuning – the pretrained encoder fail to significantly outperform models based on CLWEs.

ParsiNLU: A Suite of Language Understanding Challenges for Persian

This work introduces ParsiNLU, the first benchmark in Persian language that includes a range of language understanding tasks—reading comprehension, textual entailment, and so on, and presents the first results on state-of-the-art monolingual and multilingual pre-trained language models on this benchmark and compares them with human performance.



Baselines and Test Data for Cross-Lingual Inference

This paper proposes to advance the research in SNLI-style natural language inference toward multilingual evaluation and provides test data for four major languages: Arabic, French, Spanish, and Russian, based on cross-lingual word embeddings and machine translation.

Word Translation Without Parallel Data

It is shown that a bilingual dictionary can be built between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way.

Phrase-Based & Neural Unsupervised Machine Translation

This work investigates how to learn to translate when having access to only large monolingual corpora in each language, and proposes two model variants, a neural and a phrase-based model, which are significantly better than methods from the literature, while being simpler and having fewer hyper-parameters.

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus.

An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification

This work systematically study the neural machine translation context vectors, i.e., output of the encoder, and their power as an interlingua representation of a sentence, and assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs.

A Corpus for Multilingual Document Classification in Eight Languages

A new subset of the Reuters corpus with balanced class priors for eight languages is proposed, adding Italian, Russian, Japanese and Chinese, which provides strong baselines for all language transfer directions using multilingual word and sentence embeddings respectively.

Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

This work proposes a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages using a shared wordpiece vocabulary, and introduces an artificial token at the beginning of the input sentence to specify the required target language.

Unsupervised Machine Translation Using Monolingual Corpora Only

This work proposes a model that takes sentences from monolingual corpora in two different languages and maps them into the same latent space and effectively learns to translate without using any labeled data.

Multilingual Deep Learning

A cross language learning framework which has its origins in deep learning to learn a shared deep representation for two languages to represent data from the two languages in a common space and is comparable to state-of-the-art approaches for these tasks.

Improving Language Understanding by Generative Pre-Training

The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.