• Corpus ID: 53388625

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

@article{Frankle2019TheLT,
  title={The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks},
  author={Jonathan Frankle and Michael Carbin},
  journal={arXiv: Learning},
  year={2019}
}
Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. [] Key MethodWe present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and…
The Lottery Ticket Hypothesis at Scale
TLDR
It is found that later resetting produces stabler winning tickets and that improved stability correlates with higher winning ticket accuracy.
Rare Gems: Finding Lottery Tickets at Initialization
TLDR
G EM -M INER is proposed, which proposes lottery tickets at initialization that beat current baselines that train to better accuracy compared to simple baselines, and does so up to 19 × faster.
Dual Lottery Ticket Hypothesis
TLDR
A simple sparse network training strategy, Random Sparse Network Transformation (RST), is proposed, which introduces a regularization term to borrow learning capacity and realize information extrusion from the weights which will be masked.
Winning the Lottery with Continuous Sparsification
TLDR
Continuous Sparsification is proposed, a new algorithm to search for winning tickets which continuously removes parameters from a network during training, and learns the sub-network's structure with gradient-based methods instead of relying on pruning strategies.
Provably Efficient Lottery Ticket Discovery
TLDR
Theoretical results demonstrate the validity of the theoretical results across a variety of architectures and datasets, including multi-layer perceptrons trained on MNIST and several deep convolutional neural network (CNN) architectures trained on CIFAR10 and ImageNet.
Juvenile state hypothesis: What we can learn from lottery ticket hypothesis researches?
TLDR
A strategy that combines the idea of neural network structure search with a pruning algorithm to alleviate the difficulty of training or performance degradation of the sub-networks after pruning and the forgetting of the weights of the original lottery ticket hypothesis is proposed.
“WINNING TICKETS” WITHOUT TRAINING DATA
  • Computer Science
  • 2020
TLDR
It is shown that Paths with Higher Edge-Weights (PHEW) at initialization have higher loss gradient magnitude, resulting in more efficient training, and the structural similarity relationship between PHEW networks and pruned networks constructed through Iterated Magnitude Pruning is evaluated, concluding that the former belong in the family of winning tickets networks.
Evaluating the Emergence of Winning Tickets by Structured Pruning of Convolutional Networks
TLDR
Novel empirical evidence is presented that it is possible to obtain winning tickets when performing structured pruning of convolutional neural networks by comparing the resulting pruned networks with their versions trained with randomly initialized weights.
Good Students Play Big Lottery Better
TLDR
This paper presents a new, simpler and yet powerful technique for re-training the sub-network, called "Knowledge Distillation ticket" (KD ticket), which addresses a complementary possibility recycling useful knowledge from the late training phase of the dense model.
The Elastic Lottery Ticket Hypothesis
TLDR
The Elastic Lottery Ticket Hypothesis (E-LTH) is articulated: by mindfully replicating and re-ordering layers for one network, its corresponding winning ticket could be stretched into a subnetwork for another deeper network from the same model family, whose performance is nearly the same competitive as the latter’s winning ticket directly found by IMP.
...
...

References

SHOWING 1-10 OF 71 REFERENCES
Rethinking the Value of Network Pruning
TLDR
It is found that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization, and the need for more careful baseline evaluations in future research on structured pruning methods is suggested.
RandomOut: Using a convolutional gradient norm to win The Filter Lottery
TLDR
The gradient norm is used to evaluate the impact of a filter on error, and re-initialize filters when the gradient norm of its weights falls below a specific threshold, which consistently improves accuracy across two datasets by up to 1.8%.
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
TLDR
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum.
Pruning Convolutional Neural Networks for Resource Efficient Inference
TLDR
It is shown that pruning can lead to more than 10x theoretical (5x practical) reduction in adapted 3D-convolutional filters with a small drop in accuracy in a recurrent gesture classifier.
Understanding the difficulty of training deep feedforward neural networks
TLDR
The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
TLDR
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Exploring Sparsity in Recurrent Neural Networks
TLDR
This work proposes a technique to reduce the parameters of a network by pruning weights during the initial training of the network, which reduces the size of the model and can also help achieve significant inference time speed-up using sparse matrix multiply.
Compressibility and Generalization in Large-Scale Deep Learning
TLDR
This paper provides the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem and establishes an absolute limit on expected compressibility as a function of expected generalization error.
DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow
TLDR
Experiments show that DSD training can improve the performance of a wide range of CNN, RNN and LSTMs on the tasks of image classification, caption generation and speech recognition.
Data-free Parameter Pruning for Deep Neural Networks
TLDR
It is shown how similar neurons are redundant, and a systematic way to remove them is proposed, which can be applied on top of most networks with a fully connected layer to give a smaller network.
...
...