• Corpus ID: 108300988

What do you learn from context? Probing for sentence structure in contextualized word representations

  title={What do you learn from context? Probing for sentence structure in contextualized word representations},
  author={Ian Tenney and Patrick Xia and Berlin Chen and Alex Wang and Adam Poliak and R. Thomas McCoy and Najoung Kim and Benjamin Van Durme and Samuel R. Bowman and Dipanjan Das and Ellie Pavlick},
Contextualized representation models such as ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2018) have recently achieved state-of-the-art results on a diverse array of downstream NLP tasks. [] Key Method We probe word-level contextual representations from four recent models and investigate how they encode sentence structure across a range of syntactic, semantic, local, and long-range phenomena. We find that existing models trained on language modeling and translation produce strong representations for…

Figures and Tables from this paper

Quantifying the Contextualization of Word Representations with Semantic Class Probing

This work quantifies the amount of contextualization, i.e., how well words are interpreted in context, by studying the extent to which semantic classes of a word can be inferred from its contextualized embedding.

Context Analysis for Pre-trained Masked Language Models

A detailed analysis of contextual impact in Transformer- and BiLSTM-based masked language models suggests significant differences on the contextual impact between the two model architectures.

What does it mean to be language-agnostic? Probing multilingual sentence encoders for typological properties

This work proposes methods for probing sentence representations from state-of-the-art multilingual encoders with respect to a range of typological properties pertaining to lexical, morphological and syntactic structure and shows interesting differences in encoding linguistic variation associated with different pretraining strategies.

Linguistic Knowledge and Transferability of Contextual Representations

It is found that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge.

Infusing Finetuning with Semantic Dependencies

This approach applies novel probes to recent language models and finds that, unlike syntax, semantics is not brought to the surface by today’s pretrained models, and uses convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning, yielding benefits to natural language understanding tasks in the GLUE benchmark.

Contextual and Non-Contextual Word Embeddings: an in-depth Linguistic Investigation

It is shown that, although BERT is capable of understanding the full context of each word in an input sequence, the implicit knowledge encoded in its aggregated sentence representations is still comparable to that of a contextual-independent model.

Lost in Context? On the Sense-wise Variance of Contextualized Word Embeddings

It is quantified how much the contextualized embeddings of each word sense vary across contexts in typical pre-trained models and proposed a simple way to alleviate position-biased word representations in distance-based word sense disambiguation settings.

Improving Contextual Representation with Gloss Regularized Pre-training

This work proposes an auxiliary gloss regularizer module to BERT pre-training (GR-BERT), to enhance word semantic similarity by predicting masked words and aligning contextual embeddings to corresponding glosses simultaneously, so that the word similarity can be explicitly modeled.

Negation, Coordination, and Quantifiers in Contextualized Language Models

This paper explores whether the semantic constraints of function words are learned and how the surrounding context impacts their embeddings, and creates suitable datasets and provides new insights into the inner workings of LMs vis-a-vis function words.

Does Chinese BERT Encode Word Structure?

This work investigates Chinese BERT using both attention weight distribution statistics and probing tasks, finding that word information is captured by BERT; word-level features are mostly in the middle representation layers; and downstream tasks make different use of word features in BERT.



Dissecting Contextual Word Embeddings: Architecture and Representation

There is a tradeoff between speed and accuracy, but all architectures learn high quality contextual representations that outperform word embeddings for four challenging NLP tasks, suggesting that unsupervised biLMs, independent of architecture, are learning much more about the structure of language than previously appreciated.

Deep Contextualized Word Representations

A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals.

Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis

This work compares four objectives—language modeling, translation, skip-thought, and autoencoding—on their ability to induce syntactic and part-of-speech information, holding constant the quantity and genre of the training data, as well as the LSTM architecture.

Evaluating Compositionality in Sentence Embeddings

This work presents a new set of NLI sentence pairs that cannot be solved using only word-level knowledge and instead require some degree of compositionality, and finds that augmenting the training dataset with a new dataset improves performance on a held-out test set without loss of performance on the SNLI test set.

Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks

This work proposes a framework that facilitates better understanding of the encoded representations of sentence vectors and demonstrates the potential contribution of the approach by analyzing different sentence representation mechanisms.

Deep RNNs Encode Soft Hierarchical Syntax

A set of experiments is presented to demonstrate that deep recurrent neural networks learn internal representations that capture soft hierarchical notions of syntax from highly varied supervision, indicating that a soft syntactic hierarchy emerges.

What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties

10 probing tasks designed to capture simple linguistic features of sentences are introduced and used to study embeddings generated by three different encoders trained in eight distinct ways, uncovering intriguing properties of bothencoders and training methods.

LSTMs Can Learn Syntax-Sensitive Dependencies Well, But Modeling Structure Makes Them Better

It is found that the mere presence of syntactic information does not improve accuracy, but when model architecture is determined by syntax, number agreement is improved: top-down construction outperforms left-corner and bottom-up variants in capturing non-local structural dependencies.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Learned in Translation: Contextualized Word Vectors

Adding context vectors to a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors improves performance over using only unsupervised word and character vectors on a wide variety of common NLP tasks.