Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers

  title={Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers},
  author={Jason Phang and Haokun Liu and Samuel R. Bowman},
Despite the success of fine-tuning pretrained language encoders like BERT for downstream natural language understanding (NLU) tasks, it is still poorly understood how neural networks change after fine-tuning. In this work, we use centered kernel alignment (CKA), a method for comparing learned representations, to measure the similarity of representations in task-tuned models across layers. In experiments across twelve NLU tasks, we discover a consistent block diagonal structure in the similarity… 

Figures from this paper



What Happens To BERT Embeddings During Fine-tuning?

It is found that fine-tuning is a conservative process that primarily affects the top layers of BERT, albeit with noteworthy variation across tasks, whereas SQuAD and MNLI involve much shallower processing.

Undivided Attention: Are Intermediate Layers Necessary for BERT?

This work shows that reducing the number of intermediate layers and modifying the architecture for BERTBASE results in minimal loss in fine-tuning accuracy for downstream tasks while decreasing thenumber of parameters and training time of the model.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Linguistic Knowledge and Transferability of Contextual Representations

It is found that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge.

The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives

This work uses canonical correlation analysis and mutual information estimators to study how information flows across Transformer layers and observe that the choice of the objective determines this process.

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

A new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) is proposed that improves the BERT and RoBERTa models using two novel techniques that significantly improve the efficiency of model pre-training and performance of downstream tasks.

What do you learn from context? Probing for sentence structure in contextualized word representations

A novel edge probing task design is introduced and a broad suite of sub-sentence tasks derived from the traditional structured NLP pipeline are constructed to investigate how sentence structure is encoded across a range of syntactic, semantic, local, and long-range phenomena.

Similarity Analysis of Contextual Word Representation Models

The analysis reveals that models within the same family are more similar to one another, as may be expected, while different architectures have rather similar representations, but different individual neurons.

Transformers: State-of-the-Art Natural Language Processing

Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

The contextual representations learned by the proposed replaced token detection pre-training task substantially outperform the ones learned by methods such as BERT and XLNet given the same model size, data, and compute.