Corpus ID: 235727612

A Primer on Pretrained Multilingual Language Models

  title={A Primer on Pretrained Multilingual Language Models},
  author={Sumanth Doddapaneni and Gowtham Ramesh and Anoop Kunchukuttan and Pratyush Kumar and Mitesh M. Khapra},
Multilingual Language Models (MLLMs) such as mBERT, XLM, XLM-R, etc. have emerged as a viable option for bringing the power of pretraining to a large number of languages. Given their success in zero shot transfer learning, there has emerged a large body of work in (i) building bigger MLLMs covering a large number of languages (ii) creating exhaustive benchmarks covering a wider variety of tasks and languages for evaluating MLLMs (iii) analysing the performance of MLLMs on monolingual, zero shot… Expand

Tables from this paper

mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models
This study trains a multilingual language model with 24 languages with entity representations and shows the model consistently outperforms word-based pretrained models in various crosslingual transfer tasks. Expand


Unsupervised Cross-lingual Representation Learning at Scale
It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time. Expand
Multilingual is not enough: BERT for Finnish
While the multilingual model largely fails to reach the performance of previously proposed methods, the custom Finnish BERT model establishes new state-of-the-art results on all corpora for all reference tasks: part- of-speech tagging, named entity recognition, and dependency parsing. Expand
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
An architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts using a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora. Expand
iNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages
This paper introduces NLP resources for 11 major Indian languages from two major language families, and creates datasets for the following tasks: Article Genre Classification, Headline Prediction, Wikipedia Section-Title Prediction, Cloze-style Multiple choice QA, Winograd NLI and COPA. Expand
MLQA: Evaluating Cross-lingual Extractive Question Answering
This work presents MLQA, a multi-way aligned extractive QA evaluation benchmark intended to spur research in this area, and evaluates state-of-the-art cross-lingual models and machine-translation-based baselines onMLQA. Expand
Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks
It is found that doing fine-tuning on multiple languages together can bring further improvement in Unicoder, a universal language encoder that is insensitive to different languages. Expand
Adaptation of Deep Bidirectional Transformers for Afrikaans Language
The results show that AfriBERT improves the current state-of-the-art in most of the tasks the authors considered, and that transfer learning from multilingual to monolingual model can have a significant performance improvement on downstream tasks. Expand
Improving Multilingual Models with Language-Clustered Vocabularies
This work introduces a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters, thus balancing the trade-off between cross-lingual subword sharing and language-specific vocABularies. Expand
Emerging Cross-lingual Structure in Pretrained Language Models
It is shown that transfer is possible even when there is no shared vocabulary across the monolingual corpora and also when the text comes from very different domains, and it is strongly suggested that, much like for non-contextual word embeddings, there are universal latent symmetries in the learned embedding spaces. Expand
Cross-lingual Language Model Pretraining
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective. Expand