Intriguing Properties of Compression on Multilingual Models

  title={Intriguing Properties of Compression on Multilingual Models},
  author={Kelechi Ogueji and Orevaoghene Ahia and Gbemileke Onilude and Sebastian Gehrmann and Sara Hooker and Julia Kreutzer},
Multilingual models are often particularly de-pendent on scaling to generalize to a growing number of languages. Compression techniques are widely relied upon to reconcile the growth in model size with real world resource constraints, but compression can have a disparate effect on model performance for low-resource languages. It is thus crucial to understand the trade-offs between scale, multilin-gualism, and compression. In this work, we propose an experimental framework to char-acterize the… 



Learning Compact Metrics for MT

This work investigates the trade-off between multilinguality and model capacity with RemBERT, a state-of-the-art multilingual language model, using data from the WMT Metrics Shared Task and demonstrates how distillation can help addressing this bottleneck, by leveraging synthetic data generation and transferring knowledge from one teacher to multiple students trained on related languages.

Load What You Need: Smaller Versions of Mutlilingual BERT

This paper proposes to extract smaller models that handle fewer number of languages according to the targeted corpora and results confirm that these models can be generated that keep comparable results, while reducing up to 45% of the total number of parameters.

When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models

It is shown that transliterating unseen languages significantly improves the potential of large-scale multilingual language models on downstream tasks and provides a promising direction towards making these massively multilingual models useful for a new set of unseen languages.

Unsupervised Cross-lingual Representation Learning at Scale

It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.

Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

It is shown that it is possible to train competitive multilingual language models on less than 1 GB of text and results suggest that the “small data” approach based on similar languages may sometimes work better than joint training on large datasets with high-resource languages.

The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation

It is suggested that sparsity can play a beneficial role at curbing memorization of low frequency attributes, and therefore offers a promising so-called double bind to the low-resource double bind.

On the Prunability of Attention Heads in Multilingual BERT

This work employs pruning to quantify the robustness and interpret layer-wise importance of mBERT, finding that the importance of the encoder layers sensitively depends on the language family and the pre-training corpus size.

Extending Multilingual BERT to Low-Resource Languages

This paper proposes a simple but effective approach to extend M-BERT E-MBERT so it can benefit any new language, and shows that this approach aids languages that are already in M-berT as well.

Poor Man's BERT: Smaller and Faster Transformer Models

A number of memory-light model reduction strategies that do not require model pre-training from scratch are explored, which are able to prune BERT, RoBERTa and XLNet models by up to 40%, while maintaining up to 98% of their original performance.

Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT

This paper explores the broader cross-lingual potential of mBERT (multilingual) as a zero shot language transfer model on 5 NLP tasks covering a total of 39 languages from various language families: NLI, document classification, NER, POS tagging, and dependency parsing.