• Corpus ID: 244773609

Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

  title={Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models},
  author={Beidi Chen and Tri Dao and Kaizhao Liang and Jiaming Yang and Zhao Song and Atri Rudra and Christopher R{\'e}},
Overparameterized neural networks generalize well but are expensive to train. Ideally, one would like to reduce their computational cost while retaining their generalization benefits. Sparse model training is a simple and promising approach to achieve this, but there remain challenges as existing methods struggle with accuracy loss, slow training runtime, or difficulty in sparsifying all model components. The core problem is that searching for a sparsity mask over a discrete set of sparse matrices… 
Monarch: Expressive Structured Matrices for Efficient and Accurate Training
Surprisingly, the problem of approximating a dense weight matrix with a Monarch matrix, though nonconvex, has an analytical optimal solution and unlock new ways to train and fine-tune sparse and dense models.
A Dynamic Fast Gaussian Transform
The main result is an efficient dynamic FGT algorithm, supporting the following operations in log(n/ε) time: Adding or deleting a source point, and Estimating the “kerneldensity” of a query point with respect to sources with ε additive accuracy.


Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps
A family of matrices called kaleidoscope matrices (K-matrices) are introduced that provably capture any structured matrix with near-optimal space (parameter) and time (arithmetic operation) complexity that can be automatically learned within end-to-end pipelines to replace hand-crafted procedures.
ReSprop: Reuse Sparsified Backpropagation
  • Negar Goli, Tor M. Aamodt
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
This work proposes a new algorithm, Reuse-Sparse-Backprop (ReSprop), as a method to sparsify gradient vectors during CNN training, and introduces a generic sparse convolution neural network accelerator (GSCN), which is designed to accelerate sparse back-propagation convolutions.
SNIP: Single-shot Network Pruning based on Connection Sensitivity
This work presents a new approach that prunes a given network once at initialization prior to training, and introduces a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task.
Pruning Filters for Efficient ConvNets
This work presents an acceleration method for CNNs, where it is shown that even simple filter pruning techniques can reduce inference costs for VGG-16 and ResNet-110 by up to 38% on CIFAR10 while regaining close to the original accuracy by retraining the networks.
Learning both Weights and Connections for Efficient Neural Network
A method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections, and prunes redundant connections using a three-step method.
A Convergence Theory for Deep Learning via Over-Parameterization
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure.
Generating Long Sequences with Sparse Transformers
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more.
Butterfly Transform: An Efficient FFT Based Neural Architecture Design
It is shown that extending the butterfly operations from the FFT algorithm to a general Butterfly Transform (BFT) can be beneficial in building an efficient block structure for CNN designs, and ShuffleNet-V2+BFT outperforms state-of-the-art architecture search methods MNasNet, FBNet and MobilenetV3 in the low FLOP regime.
Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
This work proposes Scatterbrain, a novel way to unify sparse (via locality sensitive hashing) and low-rank (via kernel feature map) attention for accurate and efficient approximation, and shows that it can achieve 2.1⇥ lower error than baselines when serving as a drop-in replacement in BigGAN image generation and pre-trained T2T-ViT.