The Low-Dimensional Linear Geometry of Contextualized Word Representations

  title={The Low-Dimensional Linear Geometry of Contextualized Word Representations},
  author={Evan Hernandez and Jacob Andreas},
Black-box probing models can reliably extract linguistic features like tense, number, and syntactic role from pretrained word representations. However, the manner in which these features are encoded in representations remains poorly understood. We present a systematic study of the linear geometry of contextualized word representations in ELMO and BERT. We show that a variety of linguistic features (including structured dependency relationships) are encoded in low-dimensional subspaces. We then… 

Figures and Tables from this paper

Putting Words in BERT's Mouth: Navigating Contextualized Vector Spaces with Pseudowords
Using a contextualized “pseudoword” as a stand-in for a static embedding in the input layer and then performing masked prediction of a word in the sentence, this work is able to investigate the geometry of the BERT-space in a controlled manner around individual instances.
Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color
A thorough case study on color finds that warmer colors are, on average, better aligned to the perceptual color space than cooler ones, suggesting an intriguing connection to findings from recent work on efficient communication in color naming.


Visualizing and Measuring the Geometry of BERT
This paper describes qualitative and quantitative investigations of one particularly effective model, BERT, and finds evidence of a fine-grained geometric representation of word senses in both attention matrices and individual word embeddings.
How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings
It is found that in all layers of ELMo, BERT, and GPT-2, on average, less than 5% of the variance in a word’s contextualized representations can be explained by a static embedding for that word, providing some justification for the success of contextualization representations.
Deep Contextualized Word Representations
A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals.
A Structural Probe for Finding Syntax in Word Representations
A structural probe is proposed, which evaluates whether syntax trees are embedded in a linear transformation of a neural network’s word representation space, and shows that such transformations exist for both ELMo and BERT but not in baselines, providing evidence that entire syntax Trees are embedded implicitly in deep models’ vector geometry.
Linguistic Knowledge and Transferability of Contextual Representations
It is found that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge.
What do you learn from context? Probing for sentence structure in contextualized word representations
A novel edge probing task design is introduced and a broad suite of sub-sentence tasks derived from the traditional structured NLP pipeline are constructed to investigate how sentence structure is encoded across a range of syntactic, semantic, local, and long-range phenomena.
Understanding Learning Dynamics Of Language Models with SVCCA
This first study on the learning dynamics of neural language models is presented, using a simple and flexible analysis method called Singular Vector Canonical Correlation Analysis (SVCCA), which enables to compare learned representations across time and across models, without the need to evaluate directly on annotated data.
Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis
This work compares four objectives—language modeling, translation, skip-thought, and autoencoding—on their ability to induce syntactic and part-of-speech information, holding constant the quantity and genre of the training data, as well as the LSTM architecture.
Targeted Syntactic Evaluation of Language Models
In an experiment using this data set, an LSTM language model performed poorly on many of the constructions, and a large gap remained between its performance and the accuracy of human participants recruited online.
What Is One Grain of Sand in the Desert? Analyzing Individual Neurons in Deep NLP Models
A comprehensive analysis of neurons and proposes two methods: Linguistic Correlation Analysis, based on a supervised method to extract the most relevant neurons with respect to an extrinsic task, and Cross-model Correlation analysis, an unsupervised method to Extract salient neurons w.r.t. the model itself.