Training Your Sparse Neural Network Better with Any Mask

  title={Training Your Sparse Neural Network Better with Any Mask},
  author={Ajay Jaiswal and Haoyu Ma and Tianlong Chen and Ying Ding and Zhangyang Wang},
  booktitle={International Conference on Machine Learning},
Pruning large neural networks to create high-quality, independently trainable sparse masks, which can maintain similar performance to their dense counterparts, is very desirable due to the re-duced space and time complexity. As research effort is focused on increasingly sophisticated pruning methods that leads to sparse subnetworks trainable from the scratch, we argue for an orthogonal, under-explored theme: improving training techniques for pruned sub-networks, i.e. sparse training . Apart… 

RoS-KD: A Robust Stochastic Knowledge Distillation Approach for Noisy Medical Imaging

A Robust Stochastic Knowledge Distillation (RoS-KD) framework which mimics the notion of learning a topic from multiple sources to ensure deterrence in learning noisy information is proposed.

Symbolic Distillation for Learned TCP Congestion Control

A novel symbolic branching algorithm that enables the rule to be aware of the context in terms of various network conditions, eventually converting the NN policy into a symbolic tree and preserving and improving performance over state-of-the-art NN policies while being faster and simpler than a standard neural network.



Training Neural Networks with Fixed Sparse Masks

This paper shows that it is possible to induce a fixed sparse mask on the model’s parameters that selects a subset to update over many iterations that matches or exceeds the performance of other methods for training with sparse updates while being more efficient in terms of memory usage and communication costs.

The Difficulty of Training Sparse Neural Networks

It is shown that, despite the failure of optimizers, there is a linear path with a monotonically decreasing objective from the initialization to the "good" solution, and traversing extra dimensions may be needed to escape stationary points found in the sparse subspace.

SNIP: Single-shot Network Pruning based on Connection Sensitivity

This work presents a new approach that prunes a given network once at initialization prior to training, and introduces a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task.

A Signal Propagation Perspective for Pruning Neural Networks at Initialization

By noting connection sensitivity as a form of gradient, this work formally characterize initialization conditions to ensure reliable connection sensitivity measurements, which in turn yields effective pruning results and modifications to the existing pruning at initialization method lead to improved results on all tested network models for image classification tasks.

Rigging the Lottery: Making All Tickets Winners

This paper introduces a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods.

The State of Sparsity in Deep Neural Networks

It is shown that unstructured sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with joint sparsification and optimization, and the need for large-scale benchmarks in the field of model compression is highlighted.

Rethinking the Value of Network Pruning

It is found that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization, and the need for more careful baseline evaluations in future research on structured pruning methods is suggested.

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

This paper launches and reports the first-ofits-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs “from end to end" by dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget.

Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training

A new perspective on training deep neural networks capable of state-of-the-art performance without the need for the expensive over-parameterization is introduced by proposing the concept of In-Time Over-Parameterization (ITOP) in sparse training.

Pruning neural networks without any data by iteratively conserving synaptic flow

The data-agnostic pruning algorithm challenges the existing paradigm that, at initialization, data must be used to quantify which synapses are important, and consistently competes with or outperforms existing state-of-the-art pruning algorithms at initialization over a range of models, datasets, and sparsity constraints.