• Corpus ID: 238226830

Redesigning the Transformer Architecture with Insights from Multi-particle Dynamical Systems

  title={Redesigning the Transformer Architecture with Insights from Multi-particle Dynamical Systems},
  author={Subhabrata Dutta and Tanya Gautam and Soumen Chakrabarti and Tanmoy Chakraborty},
The Transformer and its variants have been proven to be efficient sequence learners in many different domains. Despite their staggering success, a critical issue has been the enormous number of parameters that must be trained (ranging from 10 to 10) along with the quadratic complexity of dot-product attention. In this work, we investigate the problem of approximating the two central components of the Transformer — multi-head self-attention and point-wise feed-forward transformation, with… 

Figures and Tables from this paper

Common Sense Knowledge Learning for Open Vocabulary Neural Reasoning: A First View into Chronic Disease Literature
This paper addresses reasoning tasks from open vocabulary Knowledge Bases using state-of-the-art Neural Language Models (NLMs) with applications in scientific literature with results identified NLMs that performed consistently and with significance in knowledge inference for both source and target tasks.


Axial Attention in Multidimensional Transformers
Axial Transformers is proposed, a self-attention-based autoregressive model for images and other data organized as high dimensional tensors that maintains both full expressiveness over joint distributions over data and ease of implementation with standard deep learning frameworks, while requiring reasonable memory and computation.
Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View
It is shown that the Transformer can be mathematically interpreted as a numerical Ordinary Differential Equation (ODE) solver for a convection-diffusion equation in a multi-particle dynamic system.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
A Tensorized Transformer for Language Modeling
A novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD) with tensor train decomposition is proposed, which can not only largely compress the model parameters but also obtain performance improvements.
Linformer: Self-Attention with Linear Complexity
This paper demonstrates that the self-attention mechanism of the Transformer can be approximated by a low-rank matrix, and proposes a new self-Attention mechanism, which reduces the overall self-ATTention complexity from $O(n^2)$ to $O (n)$ in both time and space.
Multi-level Residual Networks from Dynamical Systems View
This paper adopts the dynamical systems point of view, and analyzes the lesioning properties of ResNet both theoretically and experimentally, and proposes a novel method for accelerating ResNet training.
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
This work presents a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation, and demonstrates that the monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
Synthesizer: Rethinking Self-Attention in Transformer Models
The true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models is investigated and a model that learns synthetic attention weights without token-token interactions is proposed, called Synthesizer.
Reformer: The Efficient Transformer
This work replaces dot-product attention by one that uses locality-sensitive hashing and uses reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of several times, making the model much more memory-efficient and much faster on long sequences.
Are Sixteen Heads Really Better than One?
It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance.