• Corpus ID: 240354799

Skyformer: Remodel Self-Attention with Gaussian Kernel and Nyström Method

  title={Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr{\"o}m Method},
  author={Yifan Chen and Qi Zeng and Heng Ji and Yun Yang},
Transformers are expensive to train due to the quadratic time and space complexity in the self-attention mechanism. On the other hand, although kernel machines suffer from the same computation bottleneck in pairwise dot products, several approximation schemes have been successfully incorporated to considerably reduce their computational cost without sacrificing too much accuracy. In this work, we leverage the computation methods for kernel machines to alleviate the high computational cost and… 

Figures and Tables from this paper

Sketching as a Tool for Understanding and Accelerating Self-attention for Long Sequences

This work proposes Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-Attention with column sampling, adaptive row normalization and pilot sampling reutilization.

On The Computational Complexity of Self-Attention

It is proved that the time complexity of self-attention is necessarily quadratic in the input length, unless the Strong Exponential Time Hypothesis (SETH) is false, which holds even if the attention computation is performed only approximately, and for a variety of attention mechanisms.

The Devil in Linear Transformer

A new linear attention that replaces the scaling of attention matrices with a normalization to stabilize gradients and demonstrates superior performance on text classification and language modeling tasks, as well as on the challenging Long-Range Arena benchmark, surpassing vanilla transformer and existing linear variants by a clear margin.

Linear Complexity Randomized Self-attention Mechanism

A novel perspective to understand the bias in such approximation by recasting RFAs as self-normalized importance samplers is proposed and sheds light on an unbiased estimator for the whole softmax attention, called randomized attention (RA).

Empowering parameter-efficient transfer learning by recognizing the kernel structure in self-attention

This paper proposes kernel-wise adapters, namely Kernel-mix, that utilize the kernel structure in self-attention to guide the assignment of the tunable parameters in transformer-based PLMs and kernel learning.

CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling

This paper proposes Comprehensive Attention Benchmark (CAB), which validates efficient attentions in eight backbone networks to show their generalization across neural architectures, and conducts exhaustive experiments to benchmark the performances of nine widely-used efficient attention architectures designed with different philosophies on CAB.

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

KERPLE, a framework that generalizes relative position embedding for extrapolation by kernelizing positional differences, is proposed using conditionally positive definite (CPD) kernels, and it is shown that a CPD kernel can be transformed into a PD kernel by adding a constant offset.

Inducer-tuning: Connecting Prefix-tuning and Adapter-tuning

Through comprehensive empirical experiments on natural language understanding and generation tasks, it is demonstrated that inducer-tuning can close the performance gap between prefix- Tuning and fine-tuned.


It is shown that orthogonal equivariance in the embedding space is natural for seq2seq functions with knowledge, and under such Equivariance the function must take the form close to the self-attention, which shows that network structures similar to self-Attention are the right structures to represent the target function of many seq1seq problems.

Why self-attention is Natural for Sequence-to-Sequence Problems? A Perspective from Symmetries

It is shown that orthogonal equivariance in the embedding space is natural for seq2seq functions with knowledge, and under such Equivariance the function must take the form close to the self-attention, which shows that network structures similar to self-Attention are the right structures to represent the target function of many seq1seq problems.



Rethinking Attention with Performers

Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear space and time complexity, without relying on any priors such as sparsity or low-rankness are introduced.

Linformer: Self-Attention with Linear Complexity

This paper demonstrates that the self-attention mechanism of the Transformer can be approximated by a low-rank matrix, and proposes a new self-Attention mechanism, which reduces the overall self-ATTention complexity from $O(n^2)$ to $O (n)$ in both time and space.

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

This work proposes Nyströmformer - a model that exhibits favorable scalability as a function of sequence length and performs favorably relative to other efficient self-attention methods.

Revisiting the Nystrom Method for Improved Large-scale Machine Learning

An empirical evaluation of the performance quality and running time of sampling and projection methods on a diverse suite of SPSD matrices and a suite of worst-case theoretical bounds for both random sampling and random projection methods are complemented.

Recursive Sampling for the Nystrom Method

We give the first algorithm for kernel Nystrom approximation that runs in linear time in the number of training points and is provably accurate for all kernel matrices, without dependence on

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

This work proposes a new way to understand self-attention networks: it is shown that their output can be decomposed into a sum of smaller terms—or paths—each involving the operation of a sequence of attention heads across layers, and proves that selfattention possesses a strong inductive bias towards “token uniformity”.

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

This work expresses the self-attention as a linear dot-product of kernel feature maps and makes use of the associativity property of matrix products to reduce the complexity from O(N) to N, where N is the sequence length.

Generating Long Sequences with Sparse Transformers

This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.

Fast Statistical Leverage Score Approximation in Kernel Ridge Regression

A linear time (modulo polylog terms) algorithm is proposed to accurately approximate the statistical leverage scores in the stationary-kernel-based KRR with theoretical guarantees and is orders of magnitude more efficient than existing methods in selecting the representative sub-samples in the Nyström approximation.

Space and Time Efficient Kernel Density Estimation in High Dimensions

This work instantiate their framework with the Laplacian and Exponential kernels, two popular kernels which possess the aforementioned property, and presents an improvement to their framework that retains the same query time, while requiring only linear space and linear preprocessing time.