• Corpus ID: 240354066

Efficiently Modeling Long Sequences with Structured State Spaces

@article{Gu2022EfficientlyML,
  title={Efficiently Modeling Long Sequences with Structured State Spaces},
  author={Albert Gu and Karan Goel and Christopher R'e},
  journal={ArXiv},
  year={2022},
  volume={abs/2111.00396}
}
A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of 10000 or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM… 

Diagonal State Spaces are as Effective as Structured State Spaces

TLDR
The Diagonal State Space (DSS) model matches the performance of S4 on Long Range Arena tasks, speech classification on Speech Commands dataset, while being conceptually simpler and straightforward to implement.

On the Parameterization and Initialization of Diagonal State Space Models

TLDR
A simple diagonal version of S4 whose kernel computation requires just 2 lines of code and performs comparably to S4 in almost all settings, with state-of-the-art results for image, audio, and medical time-series domains, and averaging 85% on the Long Range Arena benchmark.

Long Range Language Modeling via Gated State Spaces

TLDR
This work proposes a new layer named Gated State Space (GSS) and shows that it trains significantly faster than the diagonal version of S4 on TPUs, is fairly competitive with several well-tuned Transformer-based baselines and exhibits zero-shot generalization to longer inputs while being straightforward to implement.

Long Movie Clip Classification with State-Space Video Models

TLDR
This work proposes ViS4mer, an efficient long-range video model that combines the strengths of self-attention and the recently introduced structured state-space sequence (S4) layer, which achieves state-of-the-art results in 7 out of 9 long-form movie video classi-cation tasks on the LVU benchmark and generalizes to other domains.

It's Raw! Audio Generation with State-Space Models

TLDR
SaShiMi, a new multi-scale architecture for waveform modeling built around the recently introduced S4 model for long sequence modeling, is proposed, identifying that S4 can be unstable during autoregressive generation, and providing a simple improvement to its parameterization by drawing connections to Hurwitz matrices.

Efficient Long-Text Understanding with Short-Text Models

TLDR
This work proposes SLED, a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs and shows that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pretraining step.

Character-Level Encoding with S4 for QA

TLDR
A question answering model that encodes text inputs at character level to utilize subword structures and mitigate the out-of-vocabulary problem is proposed, and both S4 and character-level encoding improve the model performance on the question answering task.

ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths

TLDR
A simple neural network building block called ChordMixer which can model the attention for long sequences with variable lengths, and substantially outperforms other neural attention models.

How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections

TLDR
A more general and intuitive formulation of the HiPPO framework is derived, which provides a simple mathematical interpretation of S4 as a decomposition onto exponentially-warped Legendre polynomials, explaining its ability to capture long dependencies.

General-purpose, long-context autoregressive modeling with Perceiver AR

TLDR
Perceiver AR is developed, an modality-agnostic architecture which uses cross-attention to map long-range inputs to a small number of latents while also maintaining end-to-end causal masking, enabling practical long-context density estimation without the need for hand-crafted sparsity patterns or memory mechanisms.

References

SHOWING 1-10 OF 60 REFERENCES

Generating Long Sequences with Sparse Transformers

TLDR
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

TLDR
A systematic evaluation of generic convolutional and recurrent architectures for sequence modeling concludes that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutionals should be regarded as a natural starting point for sequence modeled tasks.

Learning Longer-term Dependencies in RNNs with Auxiliary Losses

TLDR
This paper proposes a simple method that improves the ability to capture long term dependencies in RNNs by adding an unsupervised auxiliary loss to the original objective, making truncated backpropagation feasible for long sequences and also improving full BPTT.

Dilated Recurrent Neural Networks

TLDR
This paper introduces a simple yet effective RNN connection structure, the DilatedRNN, characterized by multi-resolution dilated recurrent skip connections and introduces a memory capacity measure, the mean recurrent length, which is more suitable for RNNs with long skip connections than existing measures.

It's Raw! Audio Generation with State-Space Models

TLDR
SaShiMi, a new multi-scale architecture for waveform modeling built around the recently introduced S4 model for long sequence modeling, is proposed, identifying that S4 can be unstable during autoregressive generation, and providing a simple improvement to its parameterization by drawing connections to Hurwitz matrices.

Language Modeling with Gated Convolutional Networks

TLDR
A finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens, is developed and is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.

Attention is All you Need

TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

TLDR
An efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics: a ProbSparse self-attention mechanism, which achieves O(L log L) in time complexity and memory usage, and has comparable performance on sequences' dependency alignment.

Time-aware Large Kernel Convolutions

TLDR
Time-aware Large Kernel (TaLK) Convolutions is introduced, a novel adaptive convolution operation that learns to predict the size of a summation kernel instead of using a fixed-sized kernel matrix.

Trellis Networks for Sequence Modeling

TLDR
Trellis networks are presented, a new architecture for sequence modeling that outperform the current state of the art methods on a variety of challenging benchmarks, including word-level language modeling and character- level language modeling tasks, and stress tests designed to evaluate long-term memory retention.
...