The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives

  title={The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives},
  author={Elena Voita and Rico Sennrich and Ivan Titov},
We seek to understand how the representations of individual tokens and the structure of the learned feature space evolve between layers in deep neural networks under different learning objectives. We chose the Transformers for our analysis as they have been shown effective with various tasks, including machine translation (MT), standard left-to-right language models (LM) and masked language modeling (MLM). Previous work used black-box probing tasks to show that the representations learned by… 

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space

This work reverse-engineering the operation of the feed-forward network layers, one of the building blocks of transformer models, shows that each update can be decomposed to sub-updates corresponding to single FFN parameter vectors, each promoting concepts that are often human-interpretable.

Analyzing Word Translation of Transformer Layers

This paper proposes approaches to analyze the translation performed in encoder / decoder layers of the Transformer and reveals that the translation starts at the very beginning of the "encoding" (specifically at the source word embedding layer), and shows how translation evolves during the forward computation of layers.

Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers

This work shows that translation already happens progressively in encoder layers and even in the input embeddings, and suggests a Transformer configuration change that can increase speed by up to a factor 2.3 with small gains in translation quality, while an 18-4 deep encoder configuration boosts translation quality by +1.4.

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

A large-scale evaluation of modeling choices and their impact on zero-shot generalization of large pretrained Transformer language models focuses on text-to-text models and shows that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero- shot generalization after purely self-supervised pretraining.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.


  • Computer Science
  • 2020
This work defines a probing classifier that is used to extract the underlying knowledge graph of nine of the currently most influential language models, including word embeddings, context encoders, and text generators, and shows that the different pre-training strategies and architectures lead to different model biases.

Context Analysis for Pre-trained Masked Language Models

A detailed analysis of contextual impact in Transformer- and BiLSTM-based masked language models suggests significant differences on the contextual impact between the two model architectures.

Analyzing Transformers in Embedding Space

A theoretical analysis where all parameters of a trained Transformer are interpreted by projecting them into the embedding space, that is, the space of vocabulary items they operate on, which opens the door to interpretation methods that, at least in part, abstract away from model specifics and operate in theembedding space only.

Interactively Generating Explanations for Transformer Language Models

This work emphasizes using prototype networks directly incorporated into the model architecture and hence explain the reasoning process behind the network’s decisions, which offers a better understanding of language models and uses human capabilities to incorporate knowledge outside of the rigid range of purely data-driven approaches.

What Happens To BERT Embeddings During Fine-tuning?

It is found that fine-tuning is a conservative process that primarily affects the top layers of BERT, albeit with noteworthy variation across tasks, whereas SQuAD and MNLI involve much shallower processing.



An Analysis of Encoder Representations in Transformer-Based Machine Translation

This work investigates the information that is learned by the attention mechanism in Transformer models with different translation quality, and sheds light on the relative strengths and weaknesses of the various encoder representations.

Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks

This paper investigates the quality of vector representations learned at different layers of NMT encoders and finds that higher layers are better at learning semantics while lower layers tend to be better for part-of-speech tagging.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

What do Neural Machine Translation Models Learn about Morphology?

This work analyzes the representations learned by neural MT models at various levels of granularity and empirically evaluates the quality of the representations for learning morphology through extrinsic part-of-speech and morphological tagging tasks.

Deep RNNs Encode Soft Hierarchical Syntax

A set of experiments is presented to demonstrate that deep recurrent neural networks learn internal representations that capture soft hierarchical notions of syntax from highly varied supervision, indicating that a soft syntactic hierarchy emerges.

The Lazy Encoder: A Fine-Grained Analysis of the Role of Morphology in Neural Machine Translation

A fine-grained analysis of how various source-side morphological features are captured at different levels of the NMT encoder while varying the target language finds no correlation between the accuracy of source morphology encoding and translation quality.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis

This work compares four objectives—language modeling, translation, skip-thought, and autoencoding—on their ability to induce syntactic and part-of-speech information, holding constant the quantity and genre of the training data, as well as the LSTM architecture.

Understanding Learning Dynamics Of Language Models with SVCCA

This first study on the learning dynamics of neural language models is presented, using a simple and flexible analysis method called Singular Vector Canonical Correlation Analysis (SVCCA), which enables to compare learned representations across time and across models, without the need to evaluate directly on annotated data.

Understanding and Improving Morphological Learning in the Neural Machine Translation Decoder

This paper analyzes how much morphology an NMT decoder learns, and investigates whether injecting target morphology in the decoder helps it to produce better translations, and presents three methods for simultaneous translation, joint-data learning, and multi-task learning.