• Corpus ID: 238634794

Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity on Pruned Neural Networks

@inproceedings{Zhang2021WhyLT,
  title={Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity on Pruned Neural Networks},
  author={Shuai Zhang and Meng Wang and Sijia Liu and Pin-Yu Chen and Jinjun Xiong},
  booktitle={Neural Information Processing Systems},
  year={2021}
}
The lottery ticket hypothesis (LTH) [20] states that learning on a properly pruned network (the winning ticket) improves test accuracy over the original unpruned network. Although LTH has been justified empirically in a broad range of deep neural network (DNN) involved applications like computer vision and natural language processing, the theoretical validation of the improved generalization of a winning ticket remains elusive. To the best of our knowledge, our work, for the first time… 

Figures from this paper

Analyzing Lottery Ticket Hypothesis from PAC-Bayesian Theory Perspective

It is shown that the PAC- Bayesian theory can provide an explicit understanding of the relationship between LTH and generalization behavior and revisit existing algorithms for winning tickets from a PAC-Bayesian perspective and provide new insights into these methods.

Can You Win Everything with A Lottery Ticket?

Overall, the results endorse choosing a good sparse subnetwork of a larger dense model, over directly training a small dense model of similar parameter counts, for researchers and engineers who seek to incorporate sparse neural networks for user-facing deployments.

Most Activation Functions Can Win the Lottery Without Excessive Depth

It is shown that a depth L + 1 network iscient, which indicates that lottery tickets can be expected to be sold at realistic, commonly used depths while only requiring logarithmic overparametriza-tion.

Convolutional and Residual Networks Provably Contain Lottery Tickets

It is proved that also modern architectures consisting of convolutional and residual layers that can be equipped with almost arbitrary activation functions can contain lottery tickets with high probability.

Data-Efficient Double-Win Lottery Tickets from Robust Pre-training

This paper designs a more rigorous concept, Double-Win Lottery Tickets, in which a located subnetwork from a pre-trained model can be independently transferred on diverse downstream tasks, to reach BOTH the same standard and robust generalization, under BOTH standard and adversarial training regimes, as the full pre- trained model can do.

SHRIMP: Sparser Random Feature Models via Iterative Magnitude Pruning

This work proposes a new method to efficiently fit high-dimensional data with inherent low-order structure in the form of sparse variable dependencies and shows that SHRIMP is better or competitive against both random sparse feature models and shrunk additive models.

Sparsity Winning Twice: Better Robust Generalization from More Efficient Training

Two alternatives for sparse adversarial training are introduced: static sparsity and dynamic sparsity, both of which allow the sparse subnetwork to adaptively adjust its connectivity pattern (while sticking to the same sparsity ratio) throughout training.

A Theoretical Understanding of Neural Network Compression from Sparse Linear Approximation

This work proposes to use the sparsity-sensitive ℓ q -norm to characterize compressibility and provide a relationship between soft sparsity of the weights in the network and the degree of compression with a controlled accuracy degradation bound, and develops adaptive algorithms for pruning each neuron in thenetwork informed by the theory.

L ONG L IVE THE L OTTERY : T HE E XISTENCE OF W IN NING T ICKETS IN L IFELONG L EARNING

This paper demonstrates for the first time that such extremely compact and independently trainable sub-networks can be also identified in the lifelong learning scenario, and introduces lottery teaching that further overcomes forgetting via knowledge distillation aided by external unlabeled data.

How does unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis

It is proved that iterative self-training converges linearly with both convergence rate and generalization accuracy improved in the order of 1 / √ M, where M is the number of unlabeled samples.

References

SHOWING 1-10 OF 75 REFERENCES

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

This work finds that dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations, and articulate the "lottery ticket hypothesis".

Stabilizing the Lottery Ticket Hypothesis

This paper modifications IMP to search for subnetworks that could have been obtained by pruning early in training rather than at iteration 0, and studies subnetwork "stability," finding that - as accuracy improves in this fashion - IMP subnets train to parameters closer to those of the full network and do so with improved consistency in the face of gradient noise.

Greedy Optimization Provably Wins the Lottery: Logarithmic Number of Winning Tickets is Enough

A greedy optimization based pruning method that has the guarantee that the discrepancy between the pruned network and the original network decays with exponentially fast rate w.r.t. the size of the pruning network, under weak assumptions that apply for most practical settings.

Picking Winning Tickets Before Training by Preserving Gradient Flow

This work argues that efficient training requires preserving the gradient flow through the network, and proposes a simple but effective pruning criterion called Gradient Signal Preservation (GraSP), which achieves significantly better performance than the baseline at extreme sparsity levels.

Proving the Lottery Ticket Hypothesis: Pruning is All You Need

An even stronger hypothesis is proved, showing that for every bounded distribution and every target network with bounded weights, a sufficiently over-parameterized neural network with random weights contains a subnetwork with roughly the same accuracy as the target network, without any further training.

Drawing early-bird tickets: Towards more efficient training of deep networks

This paper discovers for the first time that the winning tickets can be identified at the very early training stage, which it is term as early-bird (EB) tickets, via low-cost training schemes at large learning rates, consistent with recently reported observations that the key connectivity patterns of neural networks emerge early.

Logarithmic Pruning is All You Need

This work removes the most limiting assumptions of this previous work while providing significantly tighter bounds: the overparameterized network only needs a logarithmic factor number of neurons per weight of the target subnetwork.

Rigging the Lottery: Making All Tickets Winners

This paper introduces a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods.

A Convergence Theory for Deep Learning via Over-Parameterization

This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.

Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask

This paper studies the three critical components of the Lottery Ticket algorithm, showing that each may be varied significantly without impacting the overall results, and shows why setting weights to zero is important, how signs are all you need to make the reinitialized network train, and why masks behaves like training.
...