• Corpus ID: 182952565

Stabilizing the Lottery Ticket Hypothesis

@article{Frankle2019StabilizingTL,
title={Stabilizing the Lottery Ticket Hypothesis},
author={Jonathan Frankle and Gintare Karolina Dziugaite and Daniel M. Roy and Michael Carbin},
journal={arXiv: Learning},
year={2019}
}
• Published 5 March 2019
• Computer Science
• arXiv: Learning
Pruning is a well-established technique for removing unnecessary structure from neural networks after training to improve the performance of inference. Several recent results have explored the possibility of pruning at initialization time to provide similar benefits during training. In particular, the "lottery ticket hypothesis" conjectures that typical neural networks contain small subnetworks that can train to similar accuracy in a commensurate number of steps. The evidence for this claim is…
124 Citations

Figures from this paper

Evaluating the Emergence of Winning Tickets by Structured Pruning of Convolutional Networks
• Computer Science
2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI)
• 2020
Novel empirical evidence is presented that it is possible to obtain winning tickets when performing structured pruning of convolutional neural networks by comparing the resulting pruned networks with their versions trained with randomly initialized weights.
Juvenile state hypothesis: What we can learn from lottery ticket hypothesis researches?
A strategy that combines the idea of neural network structure search with a pruning algorithm to alleviate the difficulty of training or performance degradation of the sub-networks after pruning and the forgetting of the weights of the original lottery ticket hypothesis is proposed.
Winning the Lottery with Continuous Sparsification
• Computer Science
NeurIPS
• 2020
Continuous Sparsification is proposed, a new algorithm to search for winning tickets which continuously removes parameters from a network during training, and learns the sub-network's structure with gradient-based methods instead of relying on pruning strategies.
A Principled Investigation of the Lottery Ticket Hypothesis for Deep Neural Networks
In a recent paper, Frankle and Carbin advanced the hypothesis that dense, randomly initialized neural networks contain small subnetworks which, when trained in isolation, reach training accuracy comparable to the original network in the same number of passes.
Spending Your Winning Lottery Better After Drawing It
• Computer Science
• 2021
It is demonstrated that it is unnecessary for spare retraining to strictly inherit those properties from the dense network by plugging in purposeful “tweaks” of the sparse subnetwork architecture or its training recipe, and its retraining can be significantly improved than the default, especially at high sparsity levels.
“WINNING TICKETS” WITHOUT TRAINING DATA
• Computer Science
• 2020
It is shown that Paths with Higher Edge-Weights (PHEW) at initialization have higher loss gradient magnitude, resulting in more efficient training, and the structural similarity relationship between PHEW networks and pruned networks constructed through Iterated Magnitude Pruning is evaluated, concluding that the former belong in the family of winning tickets networks.
Provably Efficient Lottery Ticket Discovery
• Computer Science
ArXiv
• 2021
Theoretical results demonstrate the validity of the theoretical results across a variety of architectures and datasets, including multi-layer perceptrons trained on MNIST and several deep convolutional neural network (CNN) architectures trained on CIFAR10 and ImageNet.
Good Students Play Big Lottery Better
• Computer Science
ArXiv
• 2021
This paper presents a new, simpler and yet powerful technique for re-training the sub-network, called "Knowledge Distillation ticket" (KD ticket), which addresses a complementary possibility recycling useful knowledge from the late training phase of the dense model.
Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity on Pruned Neural Networks
• Computer Science
NeurIPS
• 2021
This work characterizes the performance of training a pruned neural network by analyzing the geometric structure of the objective function and the sample complexity to achieve zero generalization error and shows that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned, indicating the structural importance of a winning ticket.
Efficient Lottery Ticket Finding: Less Data is More
• Computer Science
ICML
• 2021
Crucially, it is shown that a PrAC set found is reusable across different network architectures, which can amortize the extra cost of finding PrAC sets, yielding a practical regime for efficient lottery ticket finding.

References

SHOWING 1-10 OF 36 REFERENCES
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
• Computer Science
ICLR
• 2019
This work finds that dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations, and articulate the "lottery ticket hypothesis".
The Lottery Ticket Hypothesis: Training Pruned Neural Networks
• Computer Science
ArXiv
• 2018
The lottery ticket hypothesis and its connection to pruning are a step toward developing architectures, initializations, and training strategies that make it possible to solve the same problems with much smaller networks.
Rethinking the Value of Network Pruning
• Computer Science
ICLR
• 2019
It is found that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization, and the need for more careful baseline evaluations in future research on structured pruning methods is suggested.
Linear Mode Connectivity and the Lottery Ticket Hypothesis
• Computer Science
ICML
• 2020
This work finds that standard vision models become stable to SGD noise in this way early in training, and uses this technique to study iterative magnitude pruning (IMP), the procedure used by work on the lottery ticket hypothesis to identify subnetworks that could have trained in isolation to full accuracy.
To prune, or not to prune: exploring the efficacy of pruning for model compression
• Computer Science
ICLR
• 2018
Across a broad range of neural network architectures, large-sparse models are found to consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.
SNIP: Single-shot Network Pruning based on Connection Sensitivity
• Computer Science
ICLR
• 2019
This work presents a new approach that prunes a given network once at initialization prior to training, and introduces a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task.
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
• Computer Science
ICLR
• 2019
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum.
Dynamic parameter reallocation improves trainability of deep convolutional networks
• Computer Science
• 2018
It is shown that neither the structure, nor the initialization of the discovered highperformance subnetwork is sufficient to explain its good performance, and it is the dynamics of parameter reallocation that are responsible for successful learning.
Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures
• Computer Science
ArXiv
• 2016
This paper introduces network trimming which iteratively optimizes the network by pruning unimportant neurons based on analysis of their outputs on a large dataset, inspired by an observation that the outputs of a significant portion of neurons in a large network are mostly zero.
A Convergence Theory for Deep Learning via Over-Parameterization
• Computer Science
ICML
• 2019
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.