# Monarch: Expressive Structured Matrices for Efficient and Accurate Training

@article{Dao2022MonarchES, title={Monarch: Expressive Structured Matrices for Efficient and Accurate Training}, author={Tri Dao and Beidi Chen and Nimit Sharad Sohoni and Arjun D Desai and Michael Poli and Jessica Grogan and Alexander Liu and Aniruddha Rajendra Rao and Atri Rudra and Christopher R{\'e}}, journal={ArXiv}, year={2022}, volume={abs/2204.00595} }

Large neural networks excel in many domains, but they are expensive to train and ﬁne-tune. A popular approach to reduce their compute/memory requirements is to replace dense weight matrices with structured ones (e.g., sparse, low-rank, Fourier transform). These methods have not seen widespread adoption (1) in end-to-end training due to unfavorable efﬁciency–quality tradeoffs, and (2) in dense-to-sparse ﬁne-tuning due to lack of tractable algorithms to approximate a given dense weight matrix. To…

## 17 Citations

### A Structured Sparse Neural Network and Its Matrix Calculations Algorithm

- Computer ScienceArXiv
- 2022

A nonsymmetric, tridiagonal matrix with offdiagonal sparse entries and offset sub and super-diagonals as well algorithms for its [pseudo]inverse and determinant calculations as well as decomposition for lower triangular matrices are introduced.

### RSC: Accelerating Graph Neural Networks Training via Randomized Sparse Computations

- Computer ScienceArXiv
- 2022

This work proposes R andomized S parse C omputation, which for the first time demonstrate the potential of training GNNs with approximated operations and proposes a switching mechanisms to improve the generalization of GNN’s trained with approximating operations.

### Spurious Valleys, NP-hardness, and Tractability of Sparse Matrix Factorization With Fixed Support

- Computer Science
- 2021

The landscape of the sparse matrix approximation with nontrivial family of supports formulation is investigated, proving the absence of spurious local valleys and spurious local minima, whose presence could prevent local optimization methods to achieve global optimality.

### Learning to Grow Pretrained Models for Efficient Transformer Training

- Computer Science
- 2023

This paper describes an approach for accelerating transformer training by learning to grow pretrained transformers, where it is learned to linearly map the parameters of the smaller model to initialize the larger model.

### Arithmetic Circuits, Structured Matrices and (not so) Deep Learning

- Computer ScienceTheory of Computing Systems
- 2022

In this survey, a recent work that combines arithmetic circuit complexity, structured matrices and deep learning essentially answers the research question of replacing unstructured weight matrices in neural networks by structured ones.

### Hyena Hierarchy: Towards Larger Convolutional Language Models

- Computer ScienceArXiv
- 2023

This work proposes Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating, and sets a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets.

### Ten Lessons We Have Learned in the New "Sparseland": A Short Handbook for Sparse Neural Network Researchers

- Computer ScienceArXiv
- 2023

This article attempts to summarize some most common confusions in SNNs, that one may come across in various scenarios such as paper review/rebuttal and talks - many drawn from the authors' own bittersweet experiences!

### Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities

- Computer ScienceArXiv
- 2022

First, the *algorithmic speedup* problem is formalized, then the fundamental building blocks of algorithmically efficient training are used to develop a taxonomy, which highlights commonalities of seemingly disparate methods and reveals current research gaps.

### Efficient Identification of Butterfly Sparse Matrix Factorizations

- Computer ScienceSIAM J. Math. Data Sci.
- 2023

It is proved that any $N \times N$ matrix having the so-called butterfly structure admits an essentially unique factorization into $J$ butterfly factors (where $N = 2^{J}$), and that the factors can be recovered by a hierarchical factorization method, which consists in recursively factorizing the considered matrix into two factors.

### Simple Hardware-Efficient Long Convolutions for Sequence Modeling

- Computer Science
- 2023

It is found that simple interventions--such as squashing the kernel weights--result in smooth kernels and recover SSM performance on a range of tasks including the long range arena, image classification, language modeling, and brain data modeling.

## References

SHOWING 1-10 OF 119 REFERENCES

### Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

- Computer ScienceICLR
- 2022

The main insight is to optimize over a continuous superset of sparse matrices with a ﬁxed structure known as products of butterﬂy matrices to relate the generalization bound of sparse models to that of dense models.

### Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps

- Computer ScienceICLR
- 2020

A family of matrices called kaleidoscope matrices (K-matrices) are introduced that provably capture any structured matrix with near-optimal space (parameter) and time (arithmetic operation) complexity that can be automatically learned within end-to-end pipelines to replace hand-crafted procedures.

### Sparse Factorization of Large Square Matrices

- Computer ScienceArXiv
- 2021

This paper proposes to approximate a large square matrix with a product of sparse full-rank matrices and uses the parametric method as a scalable attention architecture that performs strongly in learning tasks for long sequential data and defeats Transformer and its several variants.

### Top-KAST: Top-K Always Sparse Training

- Computer ScienceNeurIPS
- 2020

This work proposes Top-KAST, a method that preserves constant sparsity throughout training (in both the forward and backward-passes), and demonstrates the efficacy of this approach by showing that it performs comparably to or better than previous works when training models on the established ImageNet benchmark, whilst fully maintaining sparsity.

### Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

- Computer ScienceICML
- 2019

This work introduces a parameterization of divide-and-conquer methods that can automatically learn an efficient algorithm for many important transforms, and can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations.

### Initialization and Regularization of Factorized Neural Layers

- Computer ScienceICLR
- 2021

This work examines two simple, understudied schemes, spectral initialization and Frobenius decay, for improving their performance on the task of training low-memory residual networks, and shows how both schemes applied to multi-head attention lead to improved performance on both translation and unsupervised pre-training.

### Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

- Computer ScienceSC20: International Conference for High Performance Computing, Networking, Storage and Analysis
- 2020

This work proposes a tiling-friendly “tile-wise” sparsity pattern, which maintains a regular pattern at the tile level for efficient execution but allows for irregular, arbitrary pruning at the global scale to maintain the high accuracy.

### Sparse Linear Networks with a Fixed Butterfly Structure: Theory and Practice

- Computer ScienceUAI
- 2021

The butterfly architecture used in this work can replace any dense linear operator with a gadget consisting of a sequence of logarithmically many sparse layers, containing a total of near linear number of weights, with little compromise in expressibility of the resulting operator.

### Generating Long Sequences with Sparse Transformers

- Computer ScienceArXiv
- 2019

This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.

### To prune, or not to prune: exploring the efficacy of pruning for model compression

- Computer ScienceICLR
- 2018

Across a broad range of neural network architectures, large-sparse models are found to consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.