Lifting the Curse of Multilinguality by Pre-training Modular Transformers

  title={Lifting the Curse of Multilinguality by Pre-training Modular Transformers},
  author={Jonas Pfeiffer and Naman Goyal and Xi Victoria Lin and Xian Li and James Cross and Sebastian Riedel and Mikel Artetxe},
Multilingual pre-trained models are known to suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages. We address this issue by introducing language-specific modules, which allows us to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant. In contrast with prior work that learns language-specific components post-hoc, we pre-train the modules of our Cross-lingual Modular (X… 
BabelBERT: Massively Multilingual Transformers Meet a Massively Multilingual Lexical Resource
This work exposes massively multilingual transformers (MMTs, e.g., mBERT or XLM-R) to multilingual lexical knowledge at scale and demonstrates that the pretraining quality of word representations in the MMT for languages involved in specialization has a much larger effect on performance than the linguistic diversity of the set of constraints.
Phylogeny-Inspired Adaptation of Multilingual Models to New Languages
This study shows how language phylogenetic information can be used to improve cross-lingual transfer leveraging closely related languages in a structured, linguistically-informed man-ner.
Language Modelling with Pixels
PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels, and is more robust to noisy text inputs than BERT, further confirming the benefits of modelling language with pixels.
Crossing the Conversational Chasm: A Primer on Natural Language Processing for Multilingual Task-Oriented Dialogue Systems
An extensive overview of existing methods and resources in multilingual ToD is provided as an entry point  to this exciting and emerging field and draws parallels between components of the ToD pipeline and other NLP tasks, which can inspire solutions for learning in low-resource scenarios.
Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
One concern with the rise of large language models lies with their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging


On the Cross-lingual Transferability of Monolingual Representations
This work designs an alternative approach that transfers a monolingual model to new languages at the lexical level and shows that it is competitive with multilingual BERT on standard cross-lingUAL classification benchmarks and on a new Cross-lingual Question Answering Dataset (XQuAD).
MAD-X: An Adapter-based Framework for Multi-task Cross-lingual Transfer
MAD-X is proposed, an adapter-based framework that enables high portability and parameter-efficient transfer to arbitrary tasks and languages by learning modular language and task representations and introduces a novel invertible adapter architecture and a strong baseline method for adapting a pretrained multilingual model to a new language.
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts
This work proposes a series of novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts and demonstrates that they can yield improvements for low- resource languages written in scripts covered by the pretrained model.
Emerging Cross-lingual Structure in Pretrained Language Models
It is shown that transfer is possible even when there is no shared vocabulary across the monolingual corpora and also when the text comes from very different domains, and it is strongly suggested that, much like for non-contextual word embeddings, there are universal latent symmetries in the learned embedding spaces.
From English To Foreign Languages: Transferring Pre-trained Language Models
This work tackles the problem of transferring an existing pre-trained model from English to other languages under a limited computational budget and demonstrates that its models are better than multilingual BERT on two zero-shot tasks: natural language inference and dependency parsing.
MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer
MAD-G (Multilingual ADapter Generation), which contextually generates language adapters from language representations based on typological features, offers substantial benefits for low-resource languages, particularly on the NER task in low- resource African languages.
Rethinking embedding coupling in pre-trained language models
The analysis shows that larger output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage Transformer representations to be more general and more transferable to other tasks and languages.
Improving Multilingual Models with Language-Clustered Vocabularies
This work introduces a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters, thus balancing the trade-off between cross-lingual subword sharing and language-specific vocABularies.
From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers
It is demonstrated that the inexpensive few-shot transfer (i.e., additional fine-tuning on a few target-language instances) is surprisingly effective across the board, warranting more research efforts reaching beyond the limiting zero-shot conditions.
Orthogonal Language and Task Adapters in Zero-Shot Cross-Lingual Transfer
This work proposes orthogonal language and task adapters (dubbed orthoadapters) for cross-lingual transfer that are trained to encode language- and task-specific information that is complementary to the knowledge already stored in the pretrained transformer's parameters.