Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction

  title={Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction},
  author={Shubhanshu Mishra and Aria Haghighi},
We evaluate a simple approach to improving zero-shot multilingual transfer of mBERT on social media corpus by adding a pretraining task called translation pair prediction (TPP), which predicts whether a pair of cross-lingual texts are a valid translation. Our approach assumes access to translations (exact or approximate) between source-target language pairs, where we fine-tune a model on source language task data and evaluate the model in the target language. In particular, we focus on language… 

Figures and Tables from this paper


Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
An architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts using a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora.
Cross-lingual Language Model Pretraining
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.
Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks
It is found that doing fine-tuning on multiple languages together can bring further improvement in Unicoder, a universal language encoder that is insensitive to different languages.
Effective Parallel Corpus Mining using Bilingual Sentence Embeddings
The embedding models are trained to produce similar representations exclusively for bilingual sentence pairs that are translations of each other using a novel training method that introduces hard negatives consisting of sentences that are not translations but have some degree of semantic similarity.
WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects or
Multilingual Alignment of Contextual Word Representations
After the proposed alignment procedure, BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model, remarkably matching pseudo-fully-supervised translate-train models for Bulgarian and Greek.
Exploring multi-task multi-lingual learning of transformer models for hate speech and offensive speech identification in social media
A multi-task and multilingual approach based on recently proposed Transformer Neural Networks to solve three sub-tasks for hate speech to show that it is possible to to utilize different combined approaches to obtain models that can generalize easily on different languages and tasks, while trading off slight accuracy for a much reduced inference time compute cost.
How Language-Neutral is Multilingual BERT?
This work shows that mBERT representations can be split into a language-specific component and a language -neutral component, and that the language-neutral component is sufficiently general in terms of modeling semantics to allow high-accuracy word-alignment and sentence retrieval but is not yet good enough for the more difficult task of MT quality estimation.
How Multilingual is Multilingual BERT?
It is concluded that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs, and that the model can find translation pairs.
Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
It is consistently found that multi-phase adaptive pretraining offers large gains in task performance, and it is shown that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable.