Corpus ID: 220265735

Multi-Head Attention: Collaborate Instead of Concatenate

  title={Multi-Head Attention: Collaborate Instead of Concatenate},
  author={Jean-Baptiste Cordonnier and Andreas Loukas and Martin Jaggi},
Attention layers are widely used in natural language processing (NLP) and are beginning to influence computer vision architectures. However, they suffer from over-parameterization. For instance, it was shown that the majority of attention heads could be pruned without impacting accuracy. This work aims to enhance current understanding on how multiple heads interact. Motivated by the observation that trained attention heads share common key/query projections, we propose a collaborative multi… Expand

Figures and Tables from this paper

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
This work proposes a new way to understand self-attention networks: it is shown that their output can be decomposed into a sum of smaller terms—or paths—each involving the operation of a sequence of attention heads across layers, and proves that selfattention possesses a strong inductive bias towards “token uniformity”. Expand
Pay Attention when Required
This paper explored trade-offs and ordering of the blocks to improve upon the current Transformer architecture and proposed PAR Transformer, which needs 35% lower compute time than Transformer-XL achieved by replacing ~63% of the self-attention blocks with feed-forward blocks. Expand
ViTGAN: Training GANs with Vision Transformers
This paper integrates the ViT architecture into generative adversarial networks (GANs) and introduces novel regularization techniques for training GANs with ViTs, achieving comparable performance to state-of-the-art CNN-based StyleGAN2 on CIFAR-10, CelebA, and LSUN bedroom datasets. Expand
Tensor Methods in Computer Vision and Deep Learning
This article provides an in-depth and practical review of tensors and tensor methods in the context of representation learning and deep learning, with a particular focus on visual data analysis and computer vision applications. Expand
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models
Findings support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge. Expand
A Survey of Transformers
This survey provides a comprehensive review of various Transformer variants and proposes a new taxonomy of X-formers from three perspectives: architectural modification, pre-training, and applications. Expand


Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
It is found that the most important and confident heads play consistent and often linguistically-interpretable roles and when pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, it is observed that specialized heads are last to be pruned. Expand
Are Sixteen Heads Really Better than One?
It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance. Expand
Synthesizer: Rethinking Self-Attention in Transformer Models
The true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models is investigated and a model that learns synthetic attention weights without token-token interactions is proposed, called Synthesizer. Expand
On the Relationship between Self-Attention and Convolutional Layers
This work proves that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer, which provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Attention Augmented Convolutional Networks
It is found that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses. Expand
Stand-Alone Self-Attention in Vision Models
The results establish that stand-alone self-attention is an important addition to the vision practitioner's toolbox and is especially impactful when used in later layers. Expand
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence. Expand