• Corpus ID: 220265735

Multi-Head Attention: Collaborate Instead of Concatenate

  title={Multi-Head Attention: Collaborate Instead of Concatenate},
  author={Jean-Baptiste Cordonnier and Andreas Loukas and Martin Jaggi},
Attention layers are widely used in natural language processing (NLP) and are beginning to influence computer vision architectures. However, they suffer from over-parameterization. For instance, it was shown that the majority of attention heads could be pruned without impacting accuracy. This work aims to enhance current understanding on how multiple heads interact. Motivated by the observation that trained attention heads share common key/query projections, we propose a collaborative multi… 

Figures and Tables from this paper

Local Multi-Head Channel Self-Attention for Facial Expression Recognition

LHC is a novel self-attention module that can be easily integrated into virtually every convolutional neural network, and that is specifically designed for computer vision, with a specific focus on facial expression recognition.

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

This work proposes a new way to understand self-attention networks: it is shown that their output can be decomposed into a sum of smaller terms—or paths—each involving the operation of a sequence of attention heads across layers, and proves that selfattention possesses a strong inductive bias towards “token uniformity”.

Multi-Image Visual Question Answering

An empirical study of different feature extraction methods with different loss functions for Visual Question Answering with multiple image inputs having only one ground truth, and benchmark the results on them.

Pay Attention when Required

This paper explored trade-offs and ordering of the blocks to improve upon the current Transformer architecture and proposed PAR Transformer, which needs 35% lower compute time than Transformer-XL achieved by replacing ~63% of the self-attention blocks with feed-forward blocks.

ViTGAN: Training GANs with Vision Transformers

This paper integrates the ViT architecture into generative adversarial networks (GANs) and introduces novel regularization techniques for training GANs with ViTs, achieving comparable performance to state-of-the-art CNN-based StyleGAN2 on CIFAR-10, CelebA, and LSUN bedroom datasets.

MultiHead MultiModal Deep Interest Recommendation Network

Experiments show that the multi-head multi-modal DIN improves the recommendation prediction effect, and outperforms current state-of-the-art methods on various comprehensive indicators.

Tensor Methods in Computer Vision and Deep Learning

This article provides an in-depth and practical review of tensors and tensor methods in the context of representation learning and deep learning, with a particular focus on visual data analysis and computer vision applications.

BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset of them) are being modified. We show that with small-to-medium training data, applying BitFit on

Constructing Global Coherence Representations: Identifying Interpretability and Coherences of Transformer Attention in Time Series Data

This paper presents a method for constructing global coherence representations from Multi-Headed Attention of Transformer architectures, and presents abstraction and interpretation methods, leading to intuitive visualizations of the respective attention patterns.

A Survey of Transformers

This survey provides a comprehensive review of various Transformer variants and proposes a new taxonomy of X-formers from three perspectives: architectural modification, pre-training, and applications.



Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

It is found that the most important and confident heads play consistent and often linguistically-interpretable roles and when pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, it is observed that specialized heads are last to be pruned.

Are Sixteen Heads Really Better than One?

It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance.

Synthesizer: Rethinking Self-Attention in Transformer Models

The true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models is investigated and a model that learns synthetic attention weights without token-token interactions is proposed, called Synthesizer.

On the Relationship between Self-Attention and Convolutional Layers

This work proves that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer, which provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Attention Augmented Convolutional Networks

It is found that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.

Stand-Alone Self-Attention in Vision Models

The results establish that stand-alone self-attention is an important addition to the vision practitioner's toolbox and is especially impactful when used in later layers.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.