Identifying the Limits of Cross-Domain Knowledge Transfer for Pretrained Models

  title={Identifying the Limits of Cross-Domain Knowledge Transfer for Pretrained Models},
  author={Zhengxuan Wu and Nelson F. Liu and Christopher Potts},
There is growing evidence that pretrained language models improve task-specific fine-tuning even where the task examples are radically different from those seen in training. We study an extreme case of transfer learning by providing a systematic exploration of how much transfer occurs when models are denied any information about word identity via random scrambling. In four classification tasks and two sequence labeling tasks, we evaluate LSTMs using GloVe embeddings, BERT, and baseline models… 

Figures and Tables from this paper

Oolong: Investigating What Makes Crosslingual Transfer Hard with Controlled Studies
It is found that by far the most impactful factor for crosslingual transfer is the challenge of aligning the new embeddings with the existing transformer layers (18% drop), with little additional effect from switching tokenizers or word morphologies.


Linguistic Knowledge and Transferability of Contextual Representations
It is found that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
Investigating Transferability in Pretrained Language Models
This technique reveals that in BERT, layers with high probing performance on downstream GLUE tasks are neither necessary nor sufficient for high accuracy on those tasks, and the benefit of using pretrained parameters for a layer varies dramatically with dataset size.
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little
This paper pre-train MLMs on sentences with randomly shuffled word order, and shows that these models still achieve high accuracy after fine-tuning on many downstream tasks—including tasks specifically designed to be challenging for models that ignore word order.
Language Models as Knowledge Bases?
An in-depth analysis of the relational knowledge already present (without fine-tuning) in a wide range of state-of-the-art pretrained language models finds that BERT contains relational knowledge competitive with traditional NLP methods that have some access to oracle knowledge.
On the Cross-lingual Transferability of Monolingual Representations
This work designs an alternative approach that transfers a monolingual model to new languages at the lexical level and shows that it is competitive with multilingual BERT on standard cross-lingUAL classification benchmarks and on a new Cross-lingual Question Answering Dataset (XQuAD).
Scaling Laws for Transfer
The effective data “transferred” from pre-training is calculated by determining how much data a transformer of the same size would have required to achieve the same loss when training from scratch by a power-law of parameter count and dataset size.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
A new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) is proposed that improves the BERT and RoBERTa models using two novel techniques that significantly improve the efficiency of model pre-training and performance of downstream tasks.
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
The contextual representations learned by the proposed replaced token detection pre-training task substantially outperform the ones learned by methods such as BERT and XLNet given the same model size, data, and compute.