Corpus ID: 237416769

Establishing Interlingua in Multilingual Language Models

  title={Establishing Interlingua in Multilingual Language Models},
  author={Maksym Del and Mark Fishel},
Large multilingual language models show remarkable zero-shot cross-lingual transfer performance on a range of tasks. Follow-up works hypothesized that these models internally project representations of different languages into a shared interlingual space. However, they produced contradictory results. In this paper, we correct the famous prior work claiming that “BERT is not an Interlingua” and show that with the proper choice of sentence representation different languages actually do converge… 


Emerging Cross-lingual Structure in Pretrained Language Models
It is shown that transfer is possible even when there is no shared vocabulary across the monolingual corpora and also when the text comes from very different domains, and it is strongly suggested that, much like for non-contextual word embeddings, there are universal latent symmetries in the learned embedding spaces.
BERT is Not an Interlingua and the Bias of Tokenization
Cananical Correlation Analysis of the internal representations of a pre- trained, multilingual BERT model reveals that the model partitions representations for each language rather than using a common, shared, interlingual space.
First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT
It is shown that multilingual BERT, a popular multilingual language model, can be viewed as the stacking of two sub-networks: a multilingual encoder followed by a task-specific language-agnostic predictor.
Unsupervised Cross-lingual Representation Learning at Scale
It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.
Cross-lingual Language Model Pretraining
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
Insights on representational similarity in neural networks with canonical correlation
Comparing different neural network representations and determining how representations evolve over time remain challenging open questions in our understanding of the function of neural networks.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Similarity of Neural Network Representations Revisited
A similarity index is introduced that measures the relationship between representational similarity matrices and does not suffer from this limitation of CCA.