How Does BERT Answer Questions?: A Layer-Wise Analysis of Transformer Representations

@article{vanAken2019HowDB,
  title={How Does BERT Answer Questions?: A Layer-Wise Analysis of Transformer Representations},
  author={Betty van Aken and Benjamin Winter and Alexander L{\"o}ser and Felix A. Gers},
  journal={Proceedings of the 28th ACM International Conference on Information and Knowledge Management},
  year={2019}
}
  • Betty van Aken, Benjamin Winter, F. Gers
  • Published 11 September 2019
  • Computer Science
  • Proceedings of the 28th ACM International Conference on Information and Knowledge Management
Bidirectional Encoder Representations from Transformers (BERT) reach state-of-the-art results in a variety of Natural Language Processing tasks. However, understanding of their internal functioning is still insufficient and unsatisfactory. In order to better understand BERT and other Transformer-based models, we present a layer-wise analysis of BERT's hidden states. Unlike previous research, which mainly focuses on explaining Transformer models by their attention weights, we argue that hidden… 

Figures and Tables from this paper

VisBERT: Hidden-State Visualizations for Transformers
TLDR
VisBERT is presented, a tool for visualizing the contextual token representations within BERT for the task of (multi-hop) Question Answering that allows users to identify distinct phases in BERT’s transformations that are similar to a traditional NLP pipeline and offer insights during failed predictions.
The Devil is in the Details: Evaluating Limitations of Transformer-based Methods for Granular Tasks
TLDR
It is empirically demonstrated, across two datasets from different domains, that despite high performance in abstract document matching as expected, contextual embeddings are consistently (and at times, vastly) outperformed by simple baselines like TF-IDF for more granular tasks.
Can Edge Probing Tasks Reveal Linguistic Knowledge in QA Models?
TLDR
A critical analysis of the EP task datasets reveals that EP models may rely on spurious correlations to make predictions, and indicates even if fine-tuning changes the encoding of such knowledge, the EP tests might fail to change.
What Happens To BERT Embeddings During Fine-tuning?
TLDR
It is found that fine-tuning is a conservative process that primarily affects the top layers of BERT, albeit with noteworthy variation across tasks, whereas SQuAD and MNLI involve much shallower processing.
BERTnesia: Investigating the capture and forgetting of knowledge in BERT
TLDR
This paper utilizes knowledge base completion tasks to probe every layer of pre-trained as well as fine-tuned BERT (ranking, question answering, NER) and finds that ranking models forget the least and retain more knowledge in their final layer.
Inserting Information Bottleneck for Attribution in Transformers
TLDR
This paper applies information bottlenecks to analyze the attribution of each feature for prediction on a black-box model and shows the effectiveness of the method in terms of attribution and the ability to provide insight into how information flows through layers.
Inserting Information Bottlenecks for Attribution in Transformers
TLDR
This paper applies information bottlenecks to analyze the attribution of each feature for prediction on a black-box model and shows the effectiveness of the method in terms of attribution and the ability to provide insight into how information flows through layers.
Unsupervised Evaluation for Question Answering with Transformers
TLDR
A consistent pattern in the answer representations is observed, which it is shown can be used to automatically evaluate whether or not a predicted answer span is correct and have broad applications, e.g., in semi-automatic development of QA datasets.
Attention Flows: Analyzing and Comparing Attention Mechanisms in Language Models
TLDR
The visualization, Attention Flows, is designed to support users in querying, tracing, and comparing attention within layers, across layers, and amongst attention heads in Transformer-based language models, and to help users gain insight on how a classification decision is made.
Modifying Memories in Transformer Models
TLDR
This paper proposes a new task of explicitly modifying specific factual knowledge in Transformer models while ensuring the model performance does not degrade on the unmodified facts, and benchmarked several approaches that provide natural baseline performances on this task.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 43 REFERENCES
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Linguistic Knowledge and Transferability of Contextual Representations
TLDR
It is found that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
TLDR
This work argues for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering, and classify these tasks into skill sets so that researchers can identify (and then rectify) the failings of their systems.
Bidirectional Attention Flow for Machine Comprehension
TLDR
The BIDAF network is introduced, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
TLDR
This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.
What do you learn from context? Probing for sentence structure in contextualized word representations
TLDR
A novel edge probing task design is introduced and a broad suite of sub-sentence tasks derived from the traditional structured NLP pipeline are constructed to investigate how sentence structure is encoded across a range of syntactic, semantic, local, and long-range phenomena.
Universal Transformers
TLDR
The Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses issues of parallelizability and global receptive field, is proposed.
What do Neural Machine Translation Models Learn about Morphology?
TLDR
This work analyzes the representations learned by neural MT models at various levels of granularity and empirically evaluates the quality of the representations for learning morphology through extrinsic part-of-speech and morphological tagging tasks.
...
1
2
3
4
5
...