How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

@inproceedings{Rust2021HowGI,
  title={How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models},
  author={Phillip Rust and Jonas Pfeiffer and Ivan Vulic and Sebastian Ruder and Iryna Gurevych},
  booktitle={ACL},
  year={2021}
}
In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding… 
Multi Task Learning For Zero Shot Performance Prediction of Multilingual Models
TLDR
This work builds upon some of the existing techniques for predicting the zero-shot performance on a task, by modeling it as a multitask learning problem, and jointly train predictive models for different tasks.
On the Compatibility of Tokenizations Across Languages
TLDR
It is shown that the compatibility measure proposed allows the system designer to create vocabularies across languages that are compatible – a desideratum that so far has been neglected in multilingual models.
Pre-Trained Transformer-Based Language Models for Sundanese
TLDR
Three monolingual Transformer-based language models are pre-trained on Sundanese data that outperformed larger multilingual models despite the smaller overall pre-training data.
Wine is Not v i n. - On the Compatibility of Tokenizations Across Languages
TLDR
It is shown that the compatibility measure proposed allows the system designer to create vocabularies across languages that are compatible – a desideratum that so far has been neglected in multilingual models.
XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond
TLDR
A new strong multilingual baseline consisting of an XLM-R (Conneau et al., 2020) model pre-trained on millions of tweets in over thirty languages, alongside starter code to subsequently tune on a target task is provided.
Crossing the Conversational Chasm: A Primer on Multilingual Task-Oriented Dialogue Systems
TLDR
This work identifies two main challenges that combined hinder the faster progress in multilingual TOD: current state-of-the-art TOD models based on large pretrained neural language models are data hungry; at the same time data acquisition for TOD use cases is expensive and tedious.
MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model
TLDR
Evaluation on nine domain-specific datasets show that a single multilingual domain- specific model can outperform the general multilingual model, and performs close to its monolingual counterpart.
XLM-T: A Multilingual Language Model Toolkit for Twitter
TLDR
This paper introduces XLMT, a framework for using and evaluating multilingual language models in Twitter, a modular framework that can easily be extended to additional tasks, as well as integrated with recent efforts also aimed at the homogenization of Twitter-specific datasets.
MAD-G: Multilingual Adapter Generation for Efficient Cross-Lingual Transfer
TLDR
MAD-G (Multilingual ADapter Generation), which contextually generates language adapters from language representations based on typological features, offers substantial benefits for low-resource languages, particularly on the NER task in low- resource African languages.
Beyond Static models and test sets: Benchmarking the potential of pre-trained models across tasks and languages
Although recent Massively Multilingual Language Models (MMLMs) like mBERT and XLMR support around 100 languages, most existing multilingual NLP benchmarks provide evaluation data in only a handful of
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 102 REFERENCES
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
AraBERT: Transformer-based Model for Arabic Language Understanding
TLDR
This paper pre-trained BERT specifically for the Arabic language in the pursuit of achieving the same success that BERT did for the English language, and showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks.
IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding
TLDR
The first-ever vast resource for training, evaluation, and benchmarking on Indonesian natural language understanding (IndoNLU) tasks is introduced, releasing baseline models for all twelve tasks, as well as the framework for benchmark evaluation, thus enabling everyone to benchmark their system performances.
KR-BERT: A Small-Scale Korean-Specific Language Model
TLDR
This paper trained a Korean-specific model KR-BERT, utilizing a smaller vocabulary and dataset, and adjusted the minimal span of tokens for tokenization ranging from sub-character level to character-level to construct a better vocabulary for the model.
Unsupervised Cross-lingual Representation Learning at Scale
TLDR
It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.
Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language
TLDR
This work shows that transfer learning from a multilingual model to monolingual model results in significant growth of performance on such tasks as reading comprehension, paraphrase detection, and sentiment analysis.
Is Multilingual BERT Fluent in Language Generation?
TLDR
It is found that the English and German models perform well at generation, whereas the multilingual model is lacking, in particular, for Nordic languages.
Multilingual is not enough: BERT for Finnish
TLDR
While the multilingual model largely fails to reach the performance of previously proposed methods, the custom Finnish BERT model establishes new state-of-the-art results on all corpora for all reference tasks: part- of-speech tagging, named entity recognition, and dependency parsing.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
TLDR
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Hotel Arabic-Reviews Dataset Construction for Sentiment Analysis Applications
TLDR
This paper introduces HARD (Hotel Arabic-Reviewsdataset), the largest Book Reviews in Arabic Dataset for subjective sentiment analysis and machine language applications, and implements a polarity lexicon-based sentiment analyzer.
...
1
2
3
4
5
...