Corpus ID: 231861615

On the Regularity of Attention

  title={On the Regularity of Attention},
  author={James Vuckovic and Aristide Baratin and R{\'e}mi Tachet des Combes},
Attention is a powerful component of modern neural networks across a wide variety of domains. In this paper, we seek to quantify the regularity (i.e. the amount of smoothness) of the attention operation. To accomplish this goal, we propose a new mathematical framework that uses measure theory and integral operators to model attention. We show that this framework is consistent with the usual definition, and that it captures the essential properties of attention. Then we use this framework to… Expand
Sinkformers: Transformers with Doubly Stochastic Attention
This paper proposes to use Sinkhorn’s algorithm to make attention matrices doubly stochastic, and shows that Sinkformers enhance model accuracy in vision and natural language processing tasks, and leads to a significant improvement on 3D shapes classification. Expand


Infinite attention: NNGP and NTK for deep attention networks
A rigorous extension of results to NNs involving attention layers is provided, showing that unlike single- head attention, which induces non-Gaussian behaviour, multi-head attention architectures behave as GPs as the number of heads tends to infinity. Expand
Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks
This work presents an attention-based neural network module, the Set Transformer, specifically designed to model interactions among elements in the input set, and reduces the computation time of self-attention from quadratic to linear in the number of Elements in the set. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Invertible Residual Networks
The empirical evaluation shows that invertible ResNets perform competitively with both state-of-the-art image classifiers and flow-based generative models, something that has not been previously achieved with a single architecture. Expand
Deep Equilibrium Models
It is shown that DEQs often improve performance over these state-of-the-art models (for similar parameter counts); have similar computational requirements to existing models; and vastly reduce memory consumption (often the bottleneck for training large sequence models), demonstrating an up-to 88% memory reduction in the authors' experiments. Expand
Limits to Depth Efficiencies of Self-Attention
By identifying network width as a limiting factor, the analysis indicates that solutions for dramatically increasing the width can facilitate the next leap in self-attention expressivity. Expand
End-To-End Memory Networks
A neural network with a recurrent attention model over a possibly large external memory that is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings. Expand
Sequence to Sequence Learning with Neural Networks
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier. Expand
Understanding deep convolutional networks
  • S. Mallat
  • Medicine, Mathematics
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
  • 2016
Deep convolutional networks provide state-of-the-art classifications and regressions results over many high-dimensional problems and a mathematical framework is introduced to analyse their properties. Expand
Bidirectional Attention Flow for Machine Comprehension
The BIDAF network is introduced, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization. Expand