Corpus ID: 236318364

FNetAR: Mixing Tokens with Autoregressive Fourier Transforms

  title={FNetAR: Mixing Tokens with Autoregressive Fourier Transforms},
  author={Tim Lou and Michael Park and M. Ramezanali and Vincent Tang},
In this note we examine the autoregressive generalization of the FNet algorithm, in which selfattention layers from the standard Transformer architecture are substituted with a trivial sparse-uniform sampling procedure based on Fourier transforms. Using the Wikitext-103 benchmark, we demonstrate that FNetAR retains state-of-the-art performance (25.8 ppl) on the task of causal language modeling compared to a Transformer-XL baseline (24.2 ppl) with only half the number self-attention layers, thus… Expand

Tables from this paper


XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Controlling Computation versus Quality for Neural Sequence Models
The proposed Conditional Computation Transformer (CCT) is competitive with vanilla Transformers when allowed to utilize its full computational budget, while improving significantly over computationally equivalent baselines when operating on smaller computational budgets. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Non-local Neural Networks
This paper presents non-local operations as a generic family of building blocks for capturing long-range dependencies in computer vision and improves object detection/segmentation and pose estimation on the COCO suite of tasks. Expand
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Expand
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
Improving Language Understanding by Generative Pre-Training
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. Expand
FNet: Mixing Tokens with Fourier Transforms
It is found that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92% of the accuracy of BERT on the GLUE benchmark, but pre-trains and runs up to seven times faster on GPUs and twice as fast on TPUs. Expand
Fourier Image Transformer
It is shown that Fourier Image Transformer (FIT) can be used to solve relevant image analysis tasks in Fourier space, a domain inherently inaccessible to convolutional architectures. Expand