# Monarch: Expressive Structured Matrices for Efficient and Accurate Training

@article{Dao2022MonarchES,
title={Monarch: Expressive Structured Matrices for Efficient and Accurate Training},
author={Tri Dao and Beidi Chen and Nimit Sharad Sohoni and Arjun D Desai and Michael Poli and Jessica Grogan and Alexander Liu and Aniruddha Rajendra Rao and Atri Rudra and Christopher R{\'e}},
journal={ArXiv},
year={2022},
volume={abs/2204.00595}
}
• Published 1 April 2022
• Computer Science
• ArXiv
Large neural networks excel in many domains, but they are expensive to train and ﬁne-tune. A popular approach to reduce their compute/memory requirements is to replace dense weight matrices with structured ones (e.g., sparse, low-rank, Fourier transform). These methods have not seen widespread adoption (1) in end-to-end training due to unfavorable efﬁciency–quality tradeoffs, and (2) in dense-to-sparse ﬁne-tuning due to lack of tractable algorithms to approximate a given dense weight matrix. To…
17 Citations

## Figures and Tables from this paper

• Computer Science
ArXiv
• 2022
A nonsymmetric, tridiagonal matrix with offdiagonal sparse entries and offset sub and super-diagonals as well algorithms for its [pseudo]inverse and determinant calculations as well as decomposition for lower triangular matrices are introduced.
• Computer Science
ArXiv
• 2022
This work proposes R andomized S parse C omputation, which for the first time demonstrate the potential of training GNNs with approximated operations and proposes a switching mechanisms to improve the generalization of GNN’s trained with approximating operations.
• Computer Science
• 2021
The landscape of the sparse matrix approximation with nontrivial family of supports formulation is investigated, proving the absence of spurious local valleys and spurious local minima, whose presence could prevent local optimization methods to achieve global optimality.
• Computer Science
• 2023
This paper describes an approach for accelerating transformer training by learning to grow pretrained transformers, where it is learned to linearly map the parameters of the smaller model to initialize the larger model.
• A. Rudra
• Computer Science
Theory of Computing Systems
• 2022
In this survey, a recent work that combines arithmetic circuit complexity, structured matrices and deep learning essentially answers the research question of replacing unstructured weight matrices in neural networks by structured ones.
• Computer Science
ArXiv
• 2023
This work proposes Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating, and sets a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets.
• Computer Science
ArXiv
• 2023
This article attempts to summarize some most common confusions in SNNs, that one may come across in various scenarios such as paper review/rebuttal and talks - many drawn from the authors' own bittersweet experiences!
• Computer Science
ArXiv
• 2022
First, the *algorithmic speedup* problem is formalized, then the fundamental building blocks of algorithmically efficient training are used to develop a taxonomy, which highlights commonalities of seemingly disparate methods and reveals current research gaps.
• Computer Science
SIAM J. Math. Data Sci.
• 2023
It is proved that any $N \times N$ matrix having the so-called butterfly structure admits an essentially unique factorization into $J$ butterfly factors (where $N = 2^{J}$), and that the factors can be recovered by a hierarchical factorization method, which consists in recursively factorizing the considered matrix into two factors.
• Computer Science
• 2023
It is found that simple interventions--such as squashing the kernel weights--result in smooth kernels and recover SSM performance on a range of tasks including the long range arena, image classification, language modeling, and brain data modeling.

## References

SHOWING 1-10 OF 119 REFERENCES

• Computer Science
ICLR
• 2022
The main insight is to optimize over a continuous superset of sparse matrices with a ﬁxed structure known as products of butterﬂy matrices to relate the generalization bound of sparse models to that of dense models.
• Computer Science
ICLR
• 2020
A family of matrices called kaleidoscope matrices (K-matrices) are introduced that provably capture any structured matrix with near-optimal space (parameter) and time (arithmetic operation) complexity that can be automatically learned within end-to-end pipelines to replace hand-crafted procedures.
• Computer Science
ArXiv
• 2021
This paper proposes to approximate a large square matrix with a product of sparse full-rank matrices and uses the parametric method as a scalable attention architecture that performs strongly in learning tasks for long sequential data and defeats Transformer and its several variants.
• Computer Science
NeurIPS
• 2020
This work proposes Top-KAST, a method that preserves constant sparsity throughout training (in both the forward and backward-passes), and demonstrates the efficacy of this approach by showing that it performs comparably to or better than previous works when training models on the established ImageNet benchmark, whilst fully maintaining sparsity.
• Computer Science
ICML
• 2019
This work introduces a parameterization of divide-and-conquer methods that can automatically learn an efficient algorithm for many important transforms, and can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations.
• Computer Science
ICLR
• 2021
This work examines two simple, understudied schemes, spectral initialization and Frobenius decay, for improving their performance on the task of training low-memory residual networks, and shows how both schemes applied to multi-head attention lead to improved performance on both translation and unsupervised pre-training.
• Computer Science
SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
• 2020
This work proposes a tiling-friendly “tile-wise” sparsity pattern, which maintains a regular pattern at the tile level for efficient execution but allows for irregular, arbitrary pruning at the global scale to maintain the high accuracy.
• Computer Science
UAI
• 2021
The butterfly architecture used in this work can replace any dense linear operator with a gadget consisting of a sequence of logarithmically many sparse layers, containing a total of near linear number of weights, with little compromise in expressibility of the resulting operator.
• Computer Science
ArXiv
• 2019
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.
• Computer Science
ICLR
• 2018
Across a broad range of neural network architectures, large-sparse models are found to consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.