• Corpus ID: 238634794

# Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity on Pruned Neural Networks

@inproceedings{Zhang2021WhyLT,
title={Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity on Pruned Neural Networks},
author={Shuai Zhang and Meng Wang and Sijia Liu and Pin-Yu Chen and Jinjun Xiong},
booktitle={Neural Information Processing Systems},
year={2021}
}
• Published in
Neural Information Processing…
12 October 2021
• Computer Science
The lottery ticket hypothesis (LTH) [20] states that learning on a properly pruned network (the winning ticket) improves test accuracy over the original unpruned network. Although LTH has been justified empirically in a broad range of deep neural network (DNN) involved applications like computer vision and natural language processing, the theoretical validation of the improved generalization of a winning ticket remains elusive. To the best of our knowledge, our work, for the first time…

## Figures from this paper

### Analyzing Lottery Ticket Hypothesis from PAC-Bayesian Theory Perspective

• Computer Science
ArXiv
• 2022
It is shown that the PAC- Bayesian theory can provide an explicit understanding of the relationship between LTH and generalization behavior and revisit existing algorithms for winning tickets from a PAC-Bayesian perspective and provide new insights into these methods.

### Can You Win Everything with A Lottery Ticket?

Overall, the results endorse choosing a good sparse subnetwork of a larger dense model, over directly training a small dense model of similar parameter counts, for researchers and engineers who seek to incorporate sparse neural networks for user-facing deployments.

### Most Activation Functions Can Win the Lottery Without Excessive Depth

It is shown that a depth L + 1 network iscient, which indicates that lottery tickets can be expected to be sold at realistic, commonly used depths while only requiring logarithmic overparametriza-tion.

### Convolutional and Residual Networks Provably Contain Lottery Tickets

It is proved that also modern architectures consisting of convolutional and residual layers that can be equipped with almost arbitrary activation functions can contain lottery tickets with high probability.

### Data-Efficient Double-Win Lottery Tickets from Robust Pre-training

• Computer Science
ICML
• 2022
This paper designs a more rigorous concept, Double-Win Lottery Tickets, in which a located subnetwork from a pre-trained model can be independently transferred on diverse downstream tasks, to reach BOTH the same standard and robust generalization, under BOTH standard and adversarial training regimes, as the full pre- trained model can do.

### SHRIMP: Sparser Random Feature Models via Iterative Magnitude Pruning

• Computer Science
ArXiv
• 2021
This work proposes a new method to efficiently fit high-dimensional data with inherent low-order structure in the form of sparse variable dependencies and shows that SHRIMP is better or competitive against both random sparse feature models and shrunk additive models.

### Sparsity Winning Twice: Better Robust Generalization from More Efficient Training

• Computer Science
ICLR
• 2022
Two alternatives for sparse adversarial training are introduced: static sparsity and dynamic sparsity, both of which allow the sparse subnetwork to adaptively adjust its connectivity pattern (while sticking to the same sparsity ratio) throughout training.

### A Theoretical Understanding of Neural Network Compression from Sparse Linear Approximation

• Computer Science
ArXiv
• 2022
This work proposes to use the sparsity-sensitive ℓ q -norm to characterize compressibility and provide a relationship between soft sparsity of the weights in the network and the degree of compression with a controlled accuracy degradation bound, and develops adaptive algorithms for pruning each neuron in thenetwork informed by the theory.

### L ONG L IVE THE L OTTERY : T HE E XISTENCE OF W IN NING T ICKETS IN L IFELONG L EARNING

• Computer Science
• 2020
This paper demonstrates for the first time that such extremely compact and independently trainable sub-networks can be also identified in the lifelong learning scenario, and introduces lottery teaching that further overcomes forgetting via knowledge distillation aided by external unlabeled data.

### How does unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis

• Computer Science
ICLR
• 2022
It is proved that iterative self-training converges linearly with both convergence rate and generalization accuracy improved in the order of 1 / √ M, where M is the number of unlabeled samples.

## References

SHOWING 1-10 OF 75 REFERENCES

### The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

• Computer Science
ICLR
• 2019
This work finds that dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations, and articulate the "lottery ticket hypothesis".

### Stabilizing the Lottery Ticket Hypothesis

• Computer Science
• 2019
This paper modifications IMP to search for subnetworks that could have been obtained by pruning early in training rather than at iteration 0, and studies subnetwork "stability," finding that - as accuracy improves in this fashion - IMP subnets train to parameters closer to those of the full network and do so with improved consistency in the face of gradient noise.

### Greedy Optimization Provably Wins the Lottery: Logarithmic Number of Winning Tickets is Enough

• Computer Science
NeurIPS
• 2020
A greedy optimization based pruning method that has the guarantee that the discrepancy between the pruned network and the original network decays with exponentially fast rate w.r.t. the size of the pruning network, under weak assumptions that apply for most practical settings.

### Picking Winning Tickets Before Training by Preserving Gradient Flow

• Computer Science
ICLR
• 2020
This work argues that efficient training requires preserving the gradient flow through the network, and proposes a simple but effective pruning criterion called Gradient Signal Preservation (GraSP), which achieves significantly better performance than the baseline at extreme sparsity levels.

### Proving the Lottery Ticket Hypothesis: Pruning is All You Need

• Computer Science
ICML
• 2020
An even stronger hypothesis is proved, showing that for every bounded distribution and every target network with bounded weights, a sufficiently over-parameterized neural network with random weights contains a subnetwork with roughly the same accuracy as the target network, without any further training.

### Drawing early-bird tickets: Towards more efficient training of deep networks

• Computer Science
ICLR
• 2020
This paper discovers for the first time that the winning tickets can be identified at the very early training stage, which it is term as early-bird (EB) tickets, via low-cost training schemes at large learning rates, consistent with recently reported observations that the key connectivity patterns of neural networks emerge early.

### Logarithmic Pruning is All You Need

• Computer Science, Mathematics
NeurIPS
• 2020
This work removes the most limiting assumptions of this previous work while providing significantly tighter bounds: the overparameterized network only needs a logarithmic factor number of neurons per weight of the target subnetwork.

### Rigging the Lottery: Making All Tickets Winners

• Computer Science
ICML
• 2020
This paper introduces a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods.

### A Convergence Theory for Deep Learning via Over-Parameterization

• Computer Science
ICML
• 2019
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.

### Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask

• Computer Science
NeurIPS
• 2019
This paper studies the three critical components of the Lottery Ticket algorithm, showing that each may be varied significantly without impacting the overall results, and shows why setting weights to zero is important, how signs are all you need to make the reinitialized network train, and why masks behaves like training.