Monarch: Expressive Structured Matrices for Efficient and Accurate Training

  title={Monarch: Expressive Structured Matrices for Efficient and Accurate Training},
  author={Tri Dao and Beidi Chen and Nimit Sharad Sohoni and Arjun D Desai and Michael Poli and Jessica Grogan and Alexander Liu and Aniruddha Rajendra Rao and Atri Rudra and Christopher R{\'e}},
Large neural networks excel in many domains, but they are expensive to train and fine-tune. A popular approach to reduce their compute/memory requirements is to replace dense weight matrices with structured ones (e.g., sparse, low-rank, Fourier transform). These methods have not seen widespread adoption (1) in end-to-end training due to unfavorable efficiency–quality tradeoffs, and (2) in dense-to-sparse fine-tuning due to lack of tractable algorithms to approximate a given dense weight matrix. To… 

A Structured Sparse Neural Network and Its Matrix Calculations Algorithm

A nonsymmetric, tridiagonal matrix with offdiagonal sparse entries and offset sub and super-diagonals as well algorithms for its [pseudo]inverse and determinant calculations as well as decomposition for lower triangular matrices are introduced.

RSC: Accelerating Graph Neural Networks Training via Randomized Sparse Computations

This work proposes R andomized S parse C omputation, which for the first time demonstrate the potential of training GNNs with approximated operations and proposes a switching mechanisms to improve the generalization of GNN’s trained with approximating operations.

Spurious Valleys, NP-hardness, and Tractability of Sparse Matrix Factorization With Fixed Support

The landscape of the sparse matrix approximation with nontrivial family of supports formulation is investigated, proving the absence of spurious local valleys and spurious local minima, whose presence could prevent local optimization methods to achieve global optimality.

Learning to Grow Pretrained Models for Efficient Transformer Training

This paper describes an approach for accelerating transformer training by learning to grow pretrained transformers, where it is learned to linearly map the parameters of the smaller model to initialize the larger model.

Arithmetic Circuits, Structured Matrices and (not so) Deep Learning

  • A. Rudra
  • Computer Science
    Theory of Computing Systems
  • 2022
In this survey, a recent work that combines arithmetic circuit complexity, structured matrices and deep learning essentially answers the research question of replacing unstructured weight matrices in neural networks by structured ones.

Hyena Hierarchy: Towards Larger Convolutional Language Models

This work proposes Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating, and sets a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets.

Ten Lessons We Have Learned in the New "Sparseland": A Short Handbook for Sparse Neural Network Researchers

This article attempts to summarize some most common confusions in SNNs, that one may come across in various scenarios such as paper review/rebuttal and talks - many drawn from the authors' own bittersweet experiences!

Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities

First, the *algorithmic speedup* problem is formalized, then the fundamental building blocks of algorithmically efficient training are used to develop a taxonomy, which highlights commonalities of seemingly disparate methods and reveals current research gaps.

Efficient Identification of Butterfly Sparse Matrix Factorizations

It is proved that any $N \times N$ matrix having the so-called butterfly structure admits an essentially unique factorization into $J$ butterfly factors (where $N = 2^{J}$), and that the factors can be recovered by a hierarchical factorization method, which consists in recursively factorizing the considered matrix into two factors.

Simple Hardware-Efficient Long Convolutions for Sequence Modeling

It is found that simple interventions--such as squashing the kernel weights--result in smooth kernels and recover SSM performance on a range of tasks including the long range arena, image classification, language modeling, and brain data modeling.



Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

The main insight is to optimize over a continuous superset of sparse matrices with a fixed structure known as products of butterfly matrices to relate the generalization bound of sparse models to that of dense models.

Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps

A family of matrices called kaleidoscope matrices (K-matrices) are introduced that provably capture any structured matrix with near-optimal space (parameter) and time (arithmetic operation) complexity that can be automatically learned within end-to-end pipelines to replace hand-crafted procedures.

Sparse Factorization of Large Square Matrices

This paper proposes to approximate a large square matrix with a product of sparse full-rank matrices and uses the parametric method as a scalable attention architecture that performs strongly in learning tasks for long sequential data and defeats Transformer and its several variants.

Top-KAST: Top-K Always Sparse Training

This work proposes Top-KAST, a method that preserves constant sparsity throughout training (in both the forward and backward-passes), and demonstrates the efficacy of this approach by showing that it performs comparably to or better than previous works when training models on the established ImageNet benchmark, whilst fully maintaining sparsity.

Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

This work introduces a parameterization of divide-and-conquer methods that can automatically learn an efficient algorithm for many important transforms, and can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations.

Initialization and Regularization of Factorized Neural Layers

This work examines two simple, understudied schemes, spectral initialization and Frobenius decay, for improving their performance on the task of training low-memory residual networks, and shows how both schemes applied to multi-head attention lead to improved performance on both translation and unsupervised pre-training.

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

  • Cong GuoBo Yang Hsueh Yuhao Zhu
  • Computer Science
    SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2020
This work proposes a tiling-friendly “tile-wise” sparsity pattern, which maintains a regular pattern at the tile level for efficient execution but allows for irregular, arbitrary pruning at the global scale to maintain the high accuracy.

Sparse Linear Networks with a Fixed Butterfly Structure: Theory and Practice

The butterfly architecture used in this work can replace any dense linear operator with a gadget consisting of a sequence of logarithmically many sparse layers, containing a total of near linear number of weights, with little compromise in expressibility of the resulting operator.

Generating Long Sequences with Sparse Transformers

This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.

To prune, or not to prune: exploring the efficacy of pruning for model compression

Across a broad range of neural network architectures, large-sparse models are found to consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.