• Corpus ID: 231839425

# Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training

@inproceedings{Liu2021DoWA,
title={Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training},
author={Shiwei Liu and Lu Yin and Decebal Constantin Mocanu and Mykola Pechenizkiy},
booktitle={International Conference on Machine Learning},
year={2021}
}
• Published in
International Conference on…
4 February 2021
• Computer Science
In this paper, we introduce a new perspective on training deep neural networks capable of state-of-the-art performance without the need for the expensive over-parameterization by proposing the concept of In-Time Over-Parameterization (ITOP) in sparse training. By starting from a random sparse network and continuously exploring sparse connectivities during training, we can perform an Over-Parameterization in the space-time manifold, closing the gap in the expressibility between sparse training…
63 Citations

## Figures and Tables from this paper

• Computer Science
ArXiv
• 2021
This work introduces the FreeT ickets concept, as the first solution which can boost the performance of sparse convolutional neural networks over their dense network equivalents by a large margin, while using for complete training only a fraction of the computational resources required by the latter.
• Computer Science
NeurIPS
• 2021
This paper launches and reports the first-ofits-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs “from end to end" by dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget.
• Computer Science
ArXiv
• 2021
This paper seeks to understand this behavior using search spaces – adding weights creates extra degrees of freedom that form new paths for optimization rendering neural model training more effective, and shows how to augment search spaces to train sparse models attaining competitive scores across dozens of deep learning workloads.
• Computer Science
NeurIPS
• 2021
It is shown that pruning plasticity can be substantially improved by injecting a brain-inspired mechanism called neuroregeneration, i.e., to regenerate the same number of connections as pruned, in a manner that advances state of the art pruning methods.
• Computer Science
NeurIPS
• 2021
The Elastic Lottery Ticket Hypothesis (E-LTH) is articulated: by mindfully replicating and re-ordering layers for one network, its corresponding winning ticket could be stretched into a subnetwork for another deeper network from the same model family, whose performance is nearly the same competitive as the latter’s winning ticket directly found by IMP.
• Computer Science
ArXiv
• 2023
Double dynamic sparse training (DDST) is proposed, which automatically determines the density of the discriminator and greatly boosts the performance of sparse GANs on multiple datasets and introduces a quantity called balance ratio (BR) to quantify the balance of the generator and the discriminators.
• Computer Science
ArXiv
• 2023
A new sparse training method is proposed to produce sparse models with improved confidence calibration that utilizes two masks, including a deterministic mask and a random mask, and can be viewed as a hierarchical variational approximation of a probabilistic deep Gaussian process.
• Computer Science
ArXiv
• 2023
To accelerate and stabilize the convergence of sparse training, the gradient changes are analyzed and an adaptive gradient correction method is developed that outperforms leading sparse training methods by up to 5.0% in accuracy given the same number of training epochs, and reduces the number ofTraining epochs to achieve the same accuracy.
• Computer Science
ArXiv
• 2022
SparseVLR is developed– a novel framework to search veriﬁed locally robust sparse networks and does not require a pre-trained dense model, re-ducing the training time by 50%, and its accuracy and robustness are comparable to their dense counterparts at sparsity.
• Computer Science
ArXiv
• 2022
This paper investigates the feasibility and potentiality of using the layer freezing technique in sparse training and proposes a data sieving method for dataset-efﬁcient training, which further reduces training costs by ensuring only a partial dataset is used throughout the entire training process, which is dubbed SpFDE.

## References

SHOWING 1-10 OF 73 REFERENCES

• Computer Science
ICML
• 2019
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.
• Computer Science
ICML
• 2019
This work suggests that exploring structural degrees of freedom during training is more effective than adding extra parameters to the network, and outperforms previous static and dynamic reparameterization methods, yielding the best accuracy for a fixed parameter budget.
• Computer Science
NeurIPS
• 2020
This work proposes Top-KAST, a method that preserves constant sparsity throughout training (in both the forward and backward-passes), and demonstrates the efficacy of this approach by showing that it performs comparably to or better than previous works when training models on the established ImageNet benchmark, whilst fully maintaining sparsity.
• Computer Science
NeurIPS
• 2019
An improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters is provided.
• Computer Science
ArXiv
• 2019
It is shown that unstructured sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with joint sparsification and optimization, and the need for large-scale benchmarks in the field of model compression is highlighted.
• Computer Science
ICML
• 2020
This paper introduces a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods.
• Computer Science
ICML
• 2021
This paper proposes SNT-ASGD, a novel variant of the averaged stochastic gradient optimizer, which significantly improves the performance of all sparse training methods for RNNs, and achieves state-of-the-art sparse training results, better than the dense-to-sparse methods.
• Computer Science
ArXiv
• 2019
It is shown that, despite the failure of optimizers, there is a linear path with a monotonically decreasing objective from the initialization to the "good" solution, and traversing extra dimensions may be needed to escape stationary points found in the sparse subspace.
• Computer Science
ICLR
• 2018
This work proves convergence rates of SGD to a global minimum and provides generalization guarantees for this global minimum that are independent of the network size, and shows that SGD can avoid overfitting despite the high capacity of the model.
• Computer Science
ArXiv
• 2019
This work develops sparse momentum, an algorithm which uses exponentially smoothed gradients (momentum) to identify layers and weights which reduce the error efficiently and shows that the benefits of momentum redistribution and growth increase with the depth and size of the network.