Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask?

  title={Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask?},
  author={Mansheej Paul and F. Chen and Brett W. Larsen and Jonathan Frankle and Surya Ganguli and Gintare Karolina Dziugaite},
Modern deep learning involves training costly, highly overparameterized networks, thus motivating the search for sparser networks that can still be trained to the same accuracy as the full network (i.e. matching). Iterative magnitude pruning (IMP) is a state of the art algorithm that can find such highly sparse matching subnetworks, known as winning tickets. IMP operates by iterative cycles of training, masking smallest magnitude weights, rewinding back to an early training point, and repeating… 

Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models

This work proposes to combine sharpness-aware minimization (SAM) with various task-specific model compression methods, including iterative magnitude pruning (IMP), structured pruning with a distillation objective, and post-training dynamic quantization, to lead to simpler parameterizations and thus more compressible models.

SWAMP: Sparse Weight Averaging with Multiple Particles for Iterative Magnitude Pruning

This work proposes Sparse Weight Averaging with Multiple Particles (SWAMP), a straightforward modification of IMP that achieves performance comparable to an ensemble of two IMP solutions.

Break It Down: Evidence for Structural Compositionality in Neural Networks

The results demonstrate that models oftentimes implement solutions to subroutines via modular subnetworks, which can be ablated while maintaining the functionality of other subRoutines, suggesting that neural networks may be able to learn to exhibit compositionality, obviating the need for specialized symbolic mechanisms.

CAP: Correlation-Aware Pruning for Highly-Accurate Sparse Vision Models

The Correlation Aware Pruner (CAP) is introduced, a new unstructured pruning framework which significantly pushes the compressibility limits for state-of-the-art architectures and is used to show for the first time that extremely-accurate large vision models, trained via self-supervised techniques, can also be pruned to moderate sparsities, with negligible accuracy loss.

Winning the Lottery with Continuous Sparsification

Continuous Sparsification is proposed, a new algorithm to search for winning tickets which continuously removes parameters from a network during training, and learns the sub-network's structure with gradient-based methods instead of relying on pruning strategies.

Rare Gems: Finding Lottery Tickets at Initialization

Gem-Miner is proposed which finds lottery tickets at initialization that beat current baselines and is found to be competitive or better than Iterative Magnitude Pruning (IMP), and does so up to $19\times$ faster.

Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks

This work observes that, holding the number of pre-training iterations constant, training on a small fraction of (randomly chosen) data suffices to obtain an equally good initialization for IMP and identifies novel properties of the loss landscape of dense networks that are predictive of IMP performance.

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

This work finds that dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations, and articulate the "lottery ticket hypothesis".

Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask

This paper studies the three critical components of the Lottery Ticket algorithm, showing that each may be varied significantly without impacting the overall results, and shows why setting weights to zero is important, how signs are all you need to make the reinitialized network train, and why masks behaves like training.

Picking Winning Tickets Before Training by Preserving Gradient Flow

This work argues that efficient training requires preserving the gradient flow through the network, and proposes a simple but effective pruning criterion called Gradient Signal Preservation (GraSP), which achieves significantly better performance than the baseline at extreme sparsity levels.

Linear Mode Connectivity and the Lottery Ticket Hypothesis

This work finds that standard vision models become stable to SGD noise in this way early in training, and uses this technique to study iterative magnitude pruning (IMP), the procedure used by work on the lottery ticket hypothesis to identify subnetworks that could have trained in isolation to full accuracy.

Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win

It is shown that sparse NNs have poor gradient flow at initialization and a modified initialization for unstructured connectivity is proposed and it is found that DST methods significantly improve gradient flow during training over traditional sparse training methods.

To prune, or not to prune: exploring the efficacy of pruning for model compression

Across a broad range of neural network architectures, large-sparse models are found to consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.

Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot

Experimental results show that the zero-shot random tickets outperform or attain a similar performance compared to existing "initial tickets", and a new method called "hybrid tickets", which achieves further improvement, is proposed.