Corpus ID: 216036002

Attention Module is Not Only a Weight: Analyzing Transformers with Vector Norms

  title={Attention Module is Not Only a Weight: Analyzing Transformers with Vector Norms},
  author={Goro Kobayashi and Tatsuki Kuribayashi and Sho Yokoi and Kentaro Inui},
Attention is a key component of Transformers, which have recently achieved considerable success in natural language processing. Hence, attention is being extensively studied to investigate various linguistic capabilities of Transformers, focusing on analyzing the parallels between attention weights and specific linguistic phenomena. This paper shows that attention weights alone are only one of the two factors that determine the output of attention and proposes a norm-based analysis that… Expand
Visualizing Transformers for NLP: A Brief Survey
A survey on explaining Transformer architectures through visualizations, which examines the various Transformer facets that can be explored through visual analytics and proposes a set of requirements for future Transformer visualization frameworks. Expand
Mask-Align: Self-Supervised Neural Word Alignment
This paper proposes Mask-Align, a self-supervised model specifically designed for the word alignment task, which parallelly masks and predicts each target token, and extracts high-quality alignments without any supervised loss. Expand
Does BERT Solve Commonsense Task via Commonsense Knowledge?
This work proposes two attention-based methods to analyze commonsense knowledge inside BERT, and finds that attention heads successfully capture the structured commonsenseknowledge encoded in ConceptNet, which helps BERT solve commonsense tasks directly. Expand
Multi-Stream Transformers
A Multi-stream Transformer architecture is designed and found that splitting the Transformer encoder into multiple encoder streams and allowing the model to merge multiple representational hypotheses improves performance, with further improvement obtained by adding a skip connection between the first and the final encoder layer. Expand
On Commonsense Cues in BERT for Solving Commonsense Tasks
Using two different measures, it is found that BERT does use relevant knowledge for solving the task, and the presence of commonsense knowledge is positively correlated to the model accuracy. Expand
A Primer in BERTology: What We Know About How BERT Works
This paper is the first survey of over 150 studies of the popular BERT model, reviewing the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue, and approaches to compression. Expand
Can Fine-tuning Pre-trained Models Lead to Perfect NLP? A Study of the Generalizability of Relation Extraction
From empirical experimentation, this study finds that BERT suffers a bottleneck in terms of robustness by way of randomizations, adversarial and counterfactual tests, and biases (i.e., selection and semantic) and highlights opportunities for future improvements. Expand
Memory Transformer
This work proposes and study two extensions of the Transformer baseline by adding memory tokens to store non-local representations, and creating memory bottleneck for the global information, and evaluates these memory augmented Transformers on machine translation task and demonstrates that memory size positively correlates with the model performance. Expand
On Robustness and Bias Analysis of BERT-based Relation Extraction
  • Luoqiu Li, Xiang Chen, +4 authors Huajun Chen
  • Computer Science
  • 2020
Fine-tuning pre-trained models have achieved impressive performance on standard natural language processing benchmarks. However, the resultant model generalizability remains poorly understood. We doExpand
When BERT Plays the Lottery, All Tickets Are Winning
It is shown that the "bad" subnetworks can be fine-tuned separately to achieve only slightly worse performance than the "good" ones, indicating that most weights in the pre-trained BERT are potentially useful. Expand


An Analysis of Encoder Representations in Transformer-Based Machine Translation
This work investigates the information that is learned by the attention mechanism in Transformer models with different translation quality, and sheds light on the relative strengths and weaknesses of the various encoder representations. Expand
Is Attention Interpretable?
While attention noisily predicts input components’ overall importance to a model, it is by no means a fail-safe indicator, and there are many ways in which this does not hold, where gradient-based rankings of attention weights better predict their effects than their magnitudes. Expand
On Identifiability in Transformers
It is shown that self-attention distributions are not directly interpretable and the identifiability of attention weights and token embeddings is studied, and the aggregation of context into hidden tokens is studied. Expand
From Balustrades to Pierre Vinken: Looking for Syntax in Transformer Self-Attentions
A transparent deterministic method of quantifying the amount of syntactic information present in the self-attentions is proposed, based on automatically building and evaluating phrase-structure trees from the phrase-like sequences. Expand
Do Attention Heads in BERT Track Syntactic Dependencies?
The results suggest that these models have some specialist attention heads that track individual dependency types, but no generalist head that performs holistic parsing significantly better than a trivial baseline, and that analyzing attention weights directly may not reveal much of the syntactic knowledge that BERT-style models are known to learn. Expand
An Analysis of Attention Mechanisms: The Case of Word Sense Disambiguation in Neural Machine Translation
It is concluded that attention is not the main mechanism used by NMT models to incorporate contextual information for WSD, and the experimental results suggest that N MT models learn to encode contextual information necessary for W SD in the encoder hidden states. Expand
Adding Interpretable Attention to Neural Translation Models Improves Word Alignment
This work proposes a simple model extension to the Transformer architecture that makes use of its hidden representations and is restricted to attend solely on encoder information to predict the next word, and introduces a novel alignment inference procedure which applies stochastic gradient descent to directly optimize the attention activations towards a given target word. Expand
What Does BERT Look at? An Analysis of BERT’s Attention
It is shown that certain attention heads correspond well to linguistic notions of syntax and coreference, and an attention-based probing classifier is proposed and used to demonstrate that substantial syntactic information is captured in BERT’s attention. Expand
Are Sixteen Heads Really Better than One?
It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand