Lifting the Curse of Multilinguality by Pre-training Modular Transformers

  title={Lifting the Curse of Multilinguality by Pre-training Modular Transformers},
  author={Jonas Pfeiffer and Naman Goyal and Xi Victoria Lin and Xian Li and James Cross and Sebastian Riedel and Mikel Artetxe},
Multilingual pre-trained models are known to suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages. We address this issue by introducing language-specific modules, which allows us to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant. In contrast with prior work that learns language-specific components post-hoc, we pre-train the modules of our Cross-lingual Modular (X… 

BabelBERT: Massively Multilingual Transformers Meet a Massively Multilingual Lexical Resource

This work exposes massively multilingual transformers (MMTs, e.g., mBERT or XLM-R) to multilingual lexical knowledge at scale and demonstrates that the pretraining quality of word representations in the MMT for languages involved in specialization has a much larger effect on performance than the linguistic diversity of the set of constraints.

Phylogeny-Inspired Adaptation of Multilingual Models to New Languages

This study shows how language phylogenetic information can be used to improve cross-lingual transfer leveraging closely related languages in a structured, linguistically-informed man-ner.

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

BTM improves in- and out-of-domain perplexities as compared to GPT-style Transformer LMs, and gains grow with the number of domains, suggesting more aggressive parallelism could be used to efficiently train larger models in future work.

Language Modelling with Pixels

PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels, and is more robust to noisy text inputs than BERT, further confirming the benefits of modelling language with pixels.

Crossing the Conversational Chasm: A Primer on Natural Language Processing for Multilingual Task-Oriented Dialogue Systems

An extensive overview of existing methods and resources in multilingual ToD is provided as an entry point  to this exciting and emerging field and draws parallels between components of the ToD pipeline and other NLP tasks, which can inspire solutions for learning in low-resource scenarios.

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

One concern with the rise of large language models lies with their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging

Language-Family Adapters for Multilingual Neural Machine Translation

Massively multilingual models pretrained on abundant corpora with self-supervision achieve state-of-the-art results in a wide range of natural language processing tasks. In machine translation,



On the Cross-lingual Transferability of Monolingual Representations

This work designs an alternative approach that transfers a monolingual model to new languages at the lexical level and shows that it is competitive with multilingual BERT on standard cross-lingUAL classification benchmarks and on a new Cross-lingual Question Answering Dataset (XQuAD).

MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer

MAD-X is proposed, an adapter-based framework that enables high portability and parameter-efficient transfer to arbitrary tasks and languages by learning modular language and task representations and introduces a novel invertible adapter architecture and a strong baseline method for adapting a pretrained multilingual model to a new language.

Emerging Cross-lingual Structure in Pretrained Language Models

It is shown that transfer is possible even when there is no shared vocabulary across the monolingual corpora and also when the text comes from very different domains, and it is strongly suggested that, much like for non-contextual word embeddings, there are universal latent symmetries in the learned embedding spaces.

From English To Foreign Languages: Transferring Pre-trained Language Models

This work tackles the problem of transferring an existing pre-trained model from English to other languages under a limited computational budget and demonstrates that its models are better than multilingual BERT on two zero-shot tasks: natural language inference and dependency parsing.

MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer

MAD-G (Multilingual ADapter Generation), which contextually generates language adapters from language representations based on typological features, offers substantial benefits for low-resource languages, particularly on the NER task in low- resource African languages.

Rethinking embedding coupling in pre-trained language models

The analysis shows that larger output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage Transformer representations to be more general and more transferable to other tasks and languages.

Improving Multilingual Models with Language-Clustered Vocabularies

This work introduces a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters, thus balancing the trade-off between cross-lingual subword sharing and language-specific vocABularies.

From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers

It is demonstrated that the inexpensive few-shot transfer (i.e., additional fine-tuning on a few target-language instances) is surprisingly effective across the board, warranting more research efforts reaching beyond the limiting zero-shot conditions.

Orthogonal Language and Task Adapters in Zero-Shot Cross-Lingual Transfer

This work proposes orthogonal language and task adapters (dubbed orthoadapters) for cross-lingual transfer that are trained to encode language- and task-specific information that is complementary to the knowledge already stored in the pretrained transformer's parameters.

When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models

It is shown that transliterating unseen languages significantly improves the potential of large-scale multilingual language models on downstream tasks and provides a promising direction towards making these massively multilingual models useful for a new set of unseen languages.