Information-Theoretic Probing with Minimum Description Length
@inproceedings{Voita2020InformationTheoreticPW,
title={Information-Theoretic Probing with Minimum Description Length},
author={Elena Voita and Ivan Titov},
booktitle={Conference on Empirical Methods in Natural Language Processing},
year={2020}
}To measure how well pretrained representations encode some linguistic property, it is common to use accuracy of a probe, i.e. a classifier trained to predict the property from the representations. Despite widespread adoption of probes, differences in their accuracy fail to adequately reflect differences in representations. For example, they do not substantially favour pretrained representations over randomly initialized ones. Analogously, their accuracy can be similar when probing for genuine…
Figures and Tables from this paper
196 Citations
Probing as Quantifying Inductive Bias
- 2022
Computer Science
ACL
Theoretical and empirical results suggest that the proposed framework alleviates many previous problems found in probing and is able to offer concrete evidence that—for some tasks—fastText can offer a better inductive bias than BERT.
Intrinsic Probing through Dimension Selection
- 2020
Computer Science
EMNLP
This paper proposes a novel framework based on a decomposable multivariate Gaussian probe that allows us to determine whether the linguistic information in word embeddings is dispersed or focal, and probes fastText and BERT for various morphosyntactic attributes across 36 languages.
Classifier Probes May Just Learn from Linear Context Features
- 2020
Computer Science
COLING
It is shown that the token embeddings learned by neural sentence encoders contain a significant amount of information about the exact linear context of the token, and it is hypothesized that, with such information, learning standard probing tasks may be feasible even without additional linguistic structure.
Conditional probing: measuring usable information beyond a baseline
- 2021
Computer Science, Psychology
EMNLP
This work extends a theory of usable information called V-information and proposes conditional probing, which explicitly conditions on the information in the baseline, which finds that after conditioning on non-contextual word embeddings, properties like part-of-speech are accessible at deeper layers of a network than previously thought.
Information-Theoretic Probing for Linguistic Structure
- 2020
Computer Science
ACL
An information-theoretic operationalization of probing as estimating mutual information that contradicts received wisdom: one should always select the highest performing probe one can, even if it is more complex, since it will result in a tighter estimate, and thus reveal more of the linguistic information inherent in the representation.
Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?
- 2021
Computer Science
EACL
This work examines this probing paradigm through a case study in Natural Language Inference, showing that models can learn to encode linguistic properties even if they are not needed for the task on which the model was trained, and identifies that pretrained word embeddings play a considerable role in encoding these properties.
A Latent-Variable Model for Intrinsic Probing
- 2022
Computer Science
ArXiv
This work proposes a novel latent-variable formulation for constructing intrinsic probes and derives a tractable variational approximation to the log-likelihood and finds empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
Probing as Quantifying the Inductive Bias of Pre-trained Representations
- 2021
Computer Science
ArXiv
This work presents a novel framework for probing where the goal is to evaluate the inductive bias of representations for a particular task, and provides a practical avenue to do this using Bayesian inference.
Comparing Text Representations: A Theory-Driven Approach
- 2021
Computer Science, Biology
EMNLP
This method provides a calibrated, quantitative measure of the difficulty of a classification-based NLP task, enabling comparisons between representations without requiring empirical evaluations that may be sensitive to initializations and hyperparameters.
Test Harder than You Train: Probing with Extrapolation Splits
- 2021
Computer Science
BLACKBOXNLP
This paper analyses probes in an extrapolation setting, where the inputs at test time are deliberately chosen to be ‘harder’ than the training examples, and concludes that distance-based and hard statistical criteria show the clearest differences between interpolation and extrapolation settings, while at the same time being transparent, intuitive, and easy to control.
45 References
Information-Theoretic Probing for Linguistic Structure
- 2020
Computer Science
ACL
An information-theoretic operationalization of probing as estimating mutual information that contradicts received wisdom: one should always select the highest performing probe one can, even if it is more complex, since it will result in a tighter estimate, and thus reveal more of the linguistic information inherent in the representation.
Designing and Interpreting Probes with Control Tasks
- 2019
Linguistics, Computer Science
EMNLP
Control tasks, which associate word types with random outputs, are proposed to complement linguistic tasks, and it is found that dropout, commonly used to control probe complexity, is ineffective for improving selectivity of MLPs, but that other forms of regularization are effective.
What Does BERT Look at? An Analysis of BERT’s Attention
- 2019
Computer Science
BlackboxNLP@ACL
It is shown that certain attention heads correspond well to linguistic notions of syntax and coreference, and an attention-based probing classifier is proposed and used to demonstrate that substantial syntactic information is captured in BERT’s attention.
No Training Required: Exploring Random Encoders for Sentence Classification
- 2019
Computer Science
ICLR
The aim is to put sentence embeddings on more solid footing by looking at how much modern sentences gain over random methods and providing the field with more appropriate baselines going forward, which are, as it turns out, quite strong.
Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies
- 2016
Computer Science
TACL
It is concluded that LSTMs can capture a non-trivial amount of grammatical structure given targeted supervision, but stronger architectures may be required to further reduce errors; furthermore, the language modeling signal is insufficient for capturing syntax-sensitive dependencies, and should be supplemented with more direct supervision if such dependencies need to be captured.
oLMpics-On What Language Model Pre-training Captures
- 2020
Computer Science
Transactions of the Association for Computational Linguistics
This work proposes eight reasoning tasks, which conceptually require operations such as comparison, conjunction, and composition, and findings can help future work on designing new datasets, models, and objective functions for pre-training.
Language Modeling Teaches You More than Translation Does : Lessons Learned Through Auxiliary Task Analysis
- 2018
Computer Science, Linguistics
It is found that representations from language models consistently perform best on the authors' syntactic auxiliary prediction tasks, even when trained on relatively small amounts of data, which suggests that language modeling may be the best data-rich pretraining task for transfer learning applications requiring syntactic information.
Learning and Evaluating General Linguistic Intelligence
- 2019
Computer Science
ArXiv
This work analyzes state-of-the-art natural language understanding models and conducts an extensive empirical investigation to evaluate them against general linguistic intelligence criteria, and proposes a new evaluation metric based on an online encoding of the test data that quantifies how quickly an existing agent (model) learns a new task.
Language Models are Unsupervised Multitask Learners
- 2019
Computer Science
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
An Analysis of Encoder Representations in Transformer-Based Machine Translation
- 2018
Computer Science
BlackboxNLP@EMNLP
This work investigates the information that is learned by the attention mechanism in Transformer models with different translation quality, and sheds light on the relative strengths and weaknesses of the various encoder representations.


















