Information-Theoretic Probing with Minimum Description Length

@inproceedings{Voita2020InformationTheoreticPW,
  title={Information-Theoretic Probing with Minimum Description Length},
  author={Elena Voita and Ivan Titov},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
  year={2020}
}
To measure how well pretrained representations encode some linguistic property, it is common to use accuracy of a probe, i.e. a classifier trained to predict the property from the representations. Despite widespread adoption of probes, differences in their accuracy fail to adequately reflect differences in representations. For example, they do not substantially favour pretrained representations over randomly initialized ones. Analogously, their accuracy can be similar when probing for genuine… 

Probing as Quantifying Inductive Bias

Theoretical and empirical results suggest that the proposed framework alleviates many previous problems found in probing and is able to offer concrete evidence that—for some tasks—fastText can offer a better inductive bias than BERT.

Intrinsic Probing through Dimension Selection

This paper proposes a novel framework based on a decomposable multivariate Gaussian probe that allows us to determine whether the linguistic information in word embeddings is dispersed or focal, and probes fastText and BERT for various morphosyntactic attributes across 36 languages.

Classifier Probes May Just Learn from Linear Context Features

It is shown that the token embeddings learned by neural sentence encoders contain a significant amount of information about the exact linear context of the token, and it is hypothesized that, with such information, learning standard probing tasks may be feasible even without additional linguistic structure.

Conditional probing: measuring usable information beyond a baseline

This work extends a theory of usable information called V-information and proposes conditional probing, which explicitly conditions on the information in the baseline, which finds that after conditioning on non-contextual word embeddings, properties like part-of-speech are accessible at deeper layers of a network than previously thought.

Information-Theoretic Probing for Linguistic Structure

An information-theoretic operationalization of probing as estimating mutual information that contradicts received wisdom: one should always select the highest performing probe one can, even if it is more complex, since it will result in a tighter estimate, and thus reveal more of the linguistic information inherent in the representation.

Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?

This work examines this probing paradigm through a case study in Natural Language Inference, showing that models can learn to encode linguistic properties even if they are not needed for the task on which the model was trained, and identifies that pretrained word embeddings play a considerable role in encoding these properties.

A Latent-Variable Model for Intrinsic Probing

This work proposes a novel latent-variable formulation for constructing intrinsic probes and derives a tractable variational approximation to the log-likelihood and finds empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.

Probing as Quantifying the Inductive Bias of Pre-trained Representations

This work presents a novel framework for probing where the goal is to evaluate the inductive bias of representations for a particular task, and provides a practical avenue to do this using Bayesian inference.

Comparing Text Representations: A Theory-Driven Approach

This method provides a calibrated, quantitative measure of the difficulty of a classification-based NLP task, enabling comparisons between representations without requiring empirical evaluations that may be sensitive to initializations and hyperparameters.

Test Harder than You Train: Probing with Extrapolation Splits

This paper analyses probes in an extrapolation setting, where the inputs at test time are deliberately chosen to be ‘harder’ than the training examples, and concludes that distance-based and hard statistical criteria show the clearest differences between interpolation and extrapolation settings, while at the same time being transparent, intuitive, and easy to control.
...

Information-Theoretic Probing for Linguistic Structure

An information-theoretic operationalization of probing as estimating mutual information that contradicts received wisdom: one should always select the highest performing probe one can, even if it is more complex, since it will result in a tighter estimate, and thus reveal more of the linguistic information inherent in the representation.

Designing and Interpreting Probes with Control Tasks

Control tasks, which associate word types with random outputs, are proposed to complement linguistic tasks, and it is found that dropout, commonly used to control probe complexity, is ineffective for improving selectivity of MLPs, but that other forms of regularization are effective.

What Does BERT Look at? An Analysis of BERT’s Attention

It is shown that certain attention heads correspond well to linguistic notions of syntax and coreference, and an attention-based probing classifier is proposed and used to demonstrate that substantial syntactic information is captured in BERT’s attention.

No Training Required: Exploring Random Encoders for Sentence Classification

The aim is to put sentence embeddings on more solid footing by looking at how much modern sentences gain over random methods and providing the field with more appropriate baselines going forward, which are, as it turns out, quite strong.

Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies

It is concluded that LSTMs can capture a non-trivial amount of grammatical structure given targeted supervision, but stronger architectures may be required to further reduce errors; furthermore, the language modeling signal is insufficient for capturing syntax-sensitive dependencies, and should be supplemented with more direct supervision if such dependencies need to be captured.

oLMpics-On What Language Model Pre-training Captures

This work proposes eight reasoning tasks, which conceptually require operations such as comparison, conjunction, and composition, and findings can help future work on designing new datasets, models, and objective functions for pre-training.

Language Modeling Teaches You More than Translation Does : Lessons Learned Through Auxiliary Task Analysis

    Computer Science, Linguistics
  • 2018
It is found that representations from language models consistently perform best on the authors' syntactic auxiliary prediction tasks, even when trained on relatively small amounts of data, which suggests that language modeling may be the best data-rich pretraining task for transfer learning applications requiring syntactic information.

Learning and Evaluating General Linguistic Intelligence

This work analyzes state-of-the-art natural language understanding models and conducts an extensive empirical investigation to evaluate them against general linguistic intelligence criteria, and proposes a new evaluation metric based on an online encoding of the test data that quantifies how quickly an existing agent (model) learns a new task.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

An Analysis of Encoder Representations in Transformer-Based Machine Translation

This work investigates the information that is learned by the attention mechanism in Transformer models with different translation quality, and sheds light on the relative strengths and weaknesses of the various encoder representations.