WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models

  title={WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models},
  author={Benjamin Minixhofer and Fabian Paischer and Navid Rekabsaz},
Large pretrained language models (LMs) have become the central building block of many NLP applications. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is exceedingly expensive to train these models in other languages. To alleviate this problem, we introduce a novel method – called WECHSEL – to efficiently and effectively transfer pretrained LMs to new languages. WECHSEL can be applied to any model which uses… 
History Compression via Language Models in Reinforcement Learning
The new method, HELM, enables actor-critic network architectures that contain a pretrained language Transformer for history representation as a memory module that is much more sample efficient than competitors.


From English To Foreign Languages: Transferring Pre-trained Language Models
This work tackles the problem of transferring an existing pre-trained model from English to other languages under a limited computational budget and demonstrates that its models are better than multilingual BERT on two zero-shot tasks: natural language inference and dependency parsing.
As Good as New. How to Successfully Recycle English GPT-2 to Make Models for Other Languages
The adaptation of English GPT-2 to Italian and Dutch is described by retraining lexical embeddings without tuning the Transformer layers, and it is found that generated sentences become more realistic with some additional full model finetuning, especially for Dutch.
Regularization Advantages of Multilingual Neural Language Models for Low Resource Domains
Compared to monolingual LMs, the results show significant improvements of the proposed multilingual LM when the amount of available training data is limited, indicating the advantages of cross-lingual parameter sharing in very low-resource language modeling.
CamemBERT: a Tasty French Language Model
This paper investigates the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating their language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks.
On the Cross-lingual Transferability of Monolingual Representations
This work designs an alternative approach that transfers a monolingual model to new languages at the lexical level and shows that it is competitive with multilingual BERT on standard cross-lingUAL classification benchmarks and on a new Cross-lingual Question Answering Dataset (XQuAD).
Massively Multilingual Transfer for NER
Evaluating on named entity recognition, it is shown that the proposed techniques for modulating the transfer are much more effective than strong baselines, including standard ensembling, and the unsupervised method rivals oracle selection of the single best individual model.
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
It is found that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.
Transfer Learning across Low-Resource, Related Languages for Neural Machine Translation
The experiments show that transfer learning helps word-based translation only slightly, but when used on top of a much stronger BPE baseline, it yields larger improvements of up to 4.3 BLEU.
Word Translation Without Parallel Data
It is shown that a bilingual dictionary can be built between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way.
Unsupervised Cross-lingual Representation Learning at Scale
It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.