Guiding Attention for Self-Supervised Learning with Transformers

  title={Guiding Attention for Self-Supervised Learning with Transformers},
  author={A. Deshpande and Karthik Narasimhan},
In this paper, we propose a simple and effective technique to allow for efficient self-supervised learning with bi-directional Transformers. Our approach is motivated by recent studies demonstrating that self-attention patterns in trained models contain a majority of non-linguistic regularities. We propose a computationally efficient auxiliary loss function to guide attention heads to conform to such patterns. Our method is agnostic to the actual pre-training objective and results in faster… 


The ConceptTransformer is designed, a deep learning module that exposes explanations of the output of a model in which it is embedded in terms of attention over user-defined high-level concepts that can be used to infuse domain knowledge into classifiers to improve accuracy, and conversely to extract concept-based explanations of classification outputs.

Paying More Attention to Self-attention: Improving Pre-trained Language Models via Attention Guiding

This work proposes a simple yet effective attention guiding mechanism to improve the performance of PLMs through encouraging the attention towards the established goals and proposes two kinds of attention guiding methods, i.e., the attention map discrimination guiding (MDG) and the attention pattern decorrelation guiding (PDG).

AMR Alignment: Paying Attention to Cross-Attention

This paper investigates the ability of Transformer-based parsing models to yield effective alignments without ad-hoc strategies and presents the first in-depth exploration of cross-attention for AMR by proxy of alignment between the sentence spans and the semantic units in the graph.

A Survey of Transformers

This survey provides a comprehensive review of various Transformer variants and proposes a new taxonomy of X-formers from three perspectives: architectural modification, pre-training, and applications.

An Empirical Investigation of Word Alignment Supervision for Zero-Shot Multilingual Neural Machine Translation

This paper compares and evaluates several MNMT systems on three multilingual MT benchmarks of different sizes, showing that simply supervising one cross attention head to focus both on word alignments and language labels reduces the bias towards translating into the wrong language, improving the zero-shot performance overall.

Self-Guided Body Part Alignment With Relation Transformers for Occluded Person Re-Identification

This work proposes the Self-guided Body Part Alignment method that learns cue-free semantic-aligned local prediction for feature representations to avoid high-cost dependence on external cues for person re-identification in the wild.

Retroformer: Pushing the Limits of Interpretable End-to-end Retrosynthesis Transformer

Retroformer is proposed, a novel Transformer-based architecture for retrosynthesis prediction without relying on any cheminformatics tools for molecule editing that reaches the new state-of-the-art accuracy for the end-to-end template-free retroSynthesis, and improves over many strong baselines on better molecule and reaction validity.



Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

It is found that the most important and confident heads play consistent and often linguistically-interpretable roles and when pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, it is observed that specialized heads are last to be pruned.

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

This paper proposes to replace all but one attention head of each encoder layer with simple fixed – non-learnable – attentive patterns that are solely based on position and do not require any external knowledge.

On Identifiability in Transformers

It is shown that self-attention distributions are not directly interpretable and the identifiability of attention weights and token embeddings is studied, and the aggregation of context into hidden tokens is studied.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

An Analysis of Encoder Representations in Transformer-Based Machine Translation

This work investigates the information that is learned by the attention mechanism in Transformer models with different translation quality, and sheds light on the relative strengths and weaknesses of the various encoder representations.

Training Tips for the Transformer Model

The experiments in neural machine translation using the recent Tensor2Tensor framework and the Transformer sequence-to-sequence model are described, confirming the general mantra “more data and larger models”.

Hard-Coded Gaussian Attention for Neural Machine Translation

A “hard-coded” attention variant without any learned parameters is developed, which offers insight into which components of the Transformer are actually important, which it is hoped will guide future work into the development of simpler and more efficient attention-based models.

On Layer Normalization in the Transformer Architecture

It is proved with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large and using a large learning rate makes the training unstable.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Longformer: The Long-Document Transformer

Following prior work on long-sequence transformers, the Longformer is evaluated on character-level language modeling and achieves state-of-the-art results on text8 and enwik8 and pretrain Longformer and finetune it on a variety of downstream tasks.