From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers

@inproceedings{Lauscher2020FromZT,
  title={From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers},
  author={Anne Lauscher and Vinit Ravishankar and Ivan Vuli{\'c} and Goran Glavas},
  booktitle={EMNLP},
  year={2020}
}
Massively multilingual transformers (MMTs) pretrained via language modeling (e.g., mBERT, XLM-R) have become a default paradigm for zero-shot language transfer in NLP, offering unmatched transfer performance. Current evaluations, however, verify their efficacy in transfers (a) to languages with sufficiently large pretraining corpora, and (b) between close languages. In this work, we analyze the limitations of downstream language transfer with MMTs, showing that, much like cross-lingual word… Expand
Revisiting the Primacy of English in Zero-shot Cross-lingual Transfer
TLDR
English is compared against other transfer languages for fine-tuning, and other high-resource languages such as German and Russian often transfer more effectively, especially when the set of target languages is diverse or unknown a priori. Expand
AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages
Pretrained multilingual models are able to perform cross-lingual transfer in a zero-shot setting, even for languages unseen during pretraining. However, prior work evaluating performance on unseenExpand
Orthogonal Language and Task Adapters in Zero-Shot Cross-Lingual Transfer
TLDR
This work proposes orthogonal language and task adapters (dubbed orthoadapters) for cross-lingual transfer that are trained to encode language- and task-specific information that is complementary to the knowledge already stored in the pretrained transformer's parameters. Expand
Translate & Fill: Improving Zero-Shot Multilingual Semantic Parsing with Synthetic Data
TLDR
Experimental results on three multilingual semantic parsing datasets show that data augmentation with TaF reaches accuracies competitive with similar systems which rely on traditional alignment techniques. Expand
How to Adapt Your Pretrained Multilingual Model to 1600 Languages
TLDR
This paper evaluates the performance of existing methods to adapt PMMs to new languages using a resource available for over 1600 languages: the New Testament, finding that continued pretraining, the simplest approach, performs best. Expand
Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation
TLDR
A continual pre-training (CPT) framework on mBART to effectively adapt it to unseen languages and can consistently improve the finetuning performance upon the mBart baseline, as well as other strong baselines, across all tested low-resource translation pairs containing unseen languages. Expand
Language Models are Few-shot Multilingual Learners
TLDR
It is shown that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones, and they are competitive compared to the existing state-of-the-art cross-lingual models and translation models. Expand
Modelling Latent Translations for Cross-Lingual Transfer
TLDR
A new technique is proposed that integrates both steps of the traditional pipeline (translation and classification) into a single model, by treating the intermediate translations as a latent random variable, which can be fine-tuned with a variant of Minimum Risk Training. Expand
MergeDistill: Merging Pre-trained Language Models using Distillation
TLDR
MEGEDISTILL is proposed, a framework to merge pre-trained LMs in a way that can best leverage their assets with minimal dependencies, using task-agnostic knowledge distillation and the applicability of the framework in a practical setting is demonstrated. Expand
Limitations of Knowledge Distillation for Zero-shot Transfer Learning
Pretrained transformer-based encoders such as BERT have been demonstrated to achieve state-of-the-art performance on numerous NLP tasks. Despite their success, BERT style encoders are large in sizeExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 66 REFERENCES
Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT
TLDR
This paper explores the broader cross-lingual potential of mBERT (multilingual) as a zero shot language transfer model on 5 NLP tasks covering a total of 39 languages from various language families: NLI, document classification, NER, POS tagging, and dependency parsing. Expand
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
TLDR
An architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts using a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Massively Multilingual Transfer for NER
TLDR
Evaluating on named entity recognition, it is shown that the proposed techniques for modulating the transfer are much more effective than strong baselines, including standard ensembling, and the unsupervised method rivals oracle selection of the single best individual model. Expand
XNLI: Evaluating Cross-lingual Sentence Representations
TLDR
This work constructs an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus to 14 languages, including low-resource languages such as Swahili and Urdu and finds that XNLI represents a practical and challenging evaluation suite and that directly translating the test data yields the best performance among available baselines. Expand
Unsupervised Cross-lingual Representation Learning at Scale
TLDR
It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time. Expand
How Multilingual is Multilingual BERT?
TLDR
It is concluded that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs, and that the model can find translation pairs. Expand
75 Languages, 1 Model: Parsing Universal Dependencies Universally
TLDR
It is found that fine-tuning a multilingual BERT self-attention model pretrained on 104 languages can meet or exceed state-of-the-art UPOS, UFeats, Lemmas, (and especially) UAS, and LAS scores, without requiring any recurrent or language-specific components. Expand
Cheap Translation for Cross-Lingual Named Entity Recognition
TLDR
A simple method for cross-lingual named entity recognition (NER) that works well in settings with very minimal resources, and makes use of a lexicon to “translate” annotated data available in one or several high resource language(s) into the target language, and learns a standard monolingual NER model there. Expand
Emerging Cross-lingual Structure in Pretrained Language Models
TLDR
It is shown that transfer is possible even when there is no shared vocabulary across the monolingual corpora and also when the text comes from very different domains, and it is strongly suggested that, much like for non-contextual word embeddings, there are universal latent symmetries in the learned embedding spaces. Expand
...
1
2
3
4
5
...