• Corpus ID: 244773609

# Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

@article{Chen2021PixelatedBS,
title={Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models},
author={Beidi Chen and Tri Dao and Kaizhao Liang and Jiaming Yang and Zhao Song and Atri Rudra and Christopher R{\'e}},
journal={ArXiv},
year={2021},
volume={abs/2112.00029}
}
• Published 30 November 2021
• Computer Science
• ArXiv
Overparameterized neural networks generalize well but are expensive to train. Ideally, one would like to reduce their computational cost while retaining their generalization beneﬁts. Sparse model training is a simple and promising approach to achieve this, but there remain challenges as existing methods struggle with accuracy loss, slow training runtime, or diﬃculty in sparsifying all model components. The core problem is that searching for a sparsity mask over a discrete set of sparse matrices…
2 Citations

## Figures and Tables from this paper

Monarch: Expressive Structured Matrices for Efficient and Accurate Training
• Computer Science
ArXiv
• 2022
Surprisingly, the problem of approximating a dense weight matrix with a Monarch matrix, though nonconvex, has an analytical optimal solution and unlock new ways to train and ﬁne-tune sparse and dense models.
A Dynamic Fast Gaussian Transform
• Computer Science
ArXiv
• 2022
The main result is an efficient dynamic FGT algorithm, supporting the following operations in log(n/ε) time: Adding or deleting a source point, and Estimating the “kerneldensity” of a query point with respect to sources with ε additive accuracy.

## References

SHOWING 1-10 OF 106 REFERENCES
Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps
• Computer Science
ICLR
• 2020
A family of matrices called kaleidoscope matrices (K-matrices) are introduced that provably capture any structured matrix with near-optimal space (parameter) and time (arithmetic operation) complexity that can be automatically learned within end-to-end pipelines to replace hand-crafted procedures.
ReSprop: Reuse Sparsified Backpropagation
• Computer Science
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
• 2020
This work proposes a new algorithm, Reuse-Sparse-Backprop (ReSprop), as a method to sparsify gradient vectors during CNN training, and introduces a generic sparse convolution neural network accelerator (GSCN), which is designed to accelerate sparse back-propagation convolutions.
SNIP: Single-shot Network Pruning based on Connection Sensitivity
• Computer Science
ICLR
• 2019
This work presents a new approach that prunes a given network once at initialization prior to training, and introduces a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task.
Pruning Filters for Efficient ConvNets
• Computer Science
ICLR
• 2017
This work presents an acceleration method for CNNs, where it is shown that even simple filter pruning techniques can reduce inference costs for VGG-16 and ResNet-110 by up to 38% on CIFAR10 while regaining close to the original accuracy by retraining the networks.
Learning both Weights and Connections for Efficient Neural Network
• Computer Science
NIPS
• 2015
A method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections, and prunes redundant connections using a three-step method.
A Convergence Theory for Deep Learning via Over-Parameterization
• Computer Science
ICML
• 2019
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
• Computer Science
ICML
• 2019
This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure.
Generating Long Sequences with Sparse Transformers
• Computer Science
ArXiv
• 2019
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.
Butterfly Transform: An Efficient FFT Based Neural Architecture Design
• Computer Science
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
• 2020
It is shown that extending the butterfly operations from the FFT algorithm to a general Butterfly Transform (BFT) can be beneficial in building an efficient block structure for CNN designs, and ShuffleNet-V2+BFT outperforms state-of-the-art architecture search methods MNasNet, FBNet and MobilenetV3 in the low FLOP regime.
Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
• Computer Science
ArXiv
• 2021
This work proposes Scatterbrain, a novel way to unify sparse (via locality sensitive hashing) and low-rank (via kernel feature map) attention for accurate and efficient approximation, and shows that it can achieve 2.1⇥ lower error than baselines when serving as a drop-in replacement in BigGAN image generation and pre-trained T2T-ViT.