• Corpus ID: 231839425

Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training

  title={Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training},
  author={Shiwei Liu and Lu Yin and Decebal Constantin Mocanu and Mykola Pechenizkiy},
  booktitle={International Conference on Machine Learning},
In this paper, we introduce a new perspective on training deep neural networks capable of state-of-the-art performance without the need for the expensive over-parameterization by proposing the concept of In-Time Over-Parameterization (ITOP) in sparse training. By starting from a random sparse network and continuously exploring sparse connectivities during training, we can perform an Over-Parameterization in the space-time manifold, closing the gap in the expressibility between sparse training… 

Figures and Tables from this paper

FreeTickets: Accurate, Robust and Efficient Deep Ensemble by Training with Dynamic Sparsity

This work introduces the FreeT ickets concept, as the first solution which can boost the performance of sparse convolutional neural networks over their dense network equivalents by a large margin, while using for complete training only a fraction of the computational resources required by the latter.

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

This paper launches and reports the first-ofits-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs “from end to end" by dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget.

Search Spaces for Neural Model Training

This paper seeks to understand this behavior using search spaces – adding weights creates extra degrees of freedom that form new paths for optimization rendering neural model training more effective, and shows how to augment search spaces to train sparse models attaining competitive scores across dozens of deep learning workloads.

Sparse Training via Boosting Pruning Plasticity with Neuroregeneration

It is shown that pruning plasticity can be substantially improved by injecting a brain-inspired mechanism called neuroregeneration, i.e., to regenerate the same number of connections as pruned, in a manner that advances state of the art pruning methods.

The Elastic Lottery Ticket Hypothesis

The Elastic Lottery Ticket Hypothesis (E-LTH) is articulated: by mindfully replicating and re-ordering layers for one network, its corresponding winning ticket could be stretched into a subnetwork for another deeper network from the same model family, whose performance is nearly the same competitive as the latter’s winning ticket directly found by IMP.

Double Dynamic Sparse Training for GANs

Double dynamic sparse training (DDST) is proposed, which automatically determines the density of the discriminator and greatly boosts the performance of sparse GANs on multiple datasets and introduces a quantity called balance ratio (BR) to quantify the balance of the generator and the discriminators.

Calibrating the Rigged Lottery: Making All Tickets Reliable

A new sparse training method is proposed to produce sparse models with improved confidence calibration that utilizes two masks, including a deterministic mask and a random mask, and can be viewed as a hierarchical variational approximation of a probabilistic deep Gaussian process.

Balance is Essence: Accelerating Sparse Training via Adaptive Gradient Correction

To accelerate and stabilize the convergence of sparse training, the gradient changes are analyzed and an adaptive gradient correction method is developed that outperforms leading sparse training methods by up to 5.0% in accuracy given the same number of training epochs, and reduces the number ofTraining epochs to achieve the same accuracy.

SparseVLR: A Novel Framework for Verified Locally Robust Sparse Neural Networks Search

SparseVLR is developed– a novel framework to search verified locally robust sparse networks and does not require a pre-trained dense model, re-ducing the training time by 50%, and its accuracy and robustness are comparable to their dense counterparts at sparsity.

Layer Freezing & Data Sieving: Missing Pieces of a Generic Framework for Sparse Training

This paper investigates the feasibility and potentiality of using the layer freezing technique in sparse training and proposes a data sieving method for dataset-efficient training, which further reduces training costs by ensuring only a partial dataset is used throughout the entire training process, which is dubbed SpFDE.



A Convergence Theory for Deep Learning via Over-Parameterization

This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.

Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization

This work suggests that exploring structural degrees of freedom during training is more effective than adding extra parameters to the network, and outperforms previous static and dynamic reparameterization methods, yielding the best accuracy for a fixed parameter budget.

Top-KAST: Top-K Always Sparse Training

This work proposes Top-KAST, a method that preserves constant sparsity throughout training (in both the forward and backward-passes), and demonstrates the efficacy of this approach by showing that it performs comparably to or better than previous works when training models on the established ImageNet benchmark, whilst fully maintaining sparsity.

An Improved Analysis of Training Over-parameterized Deep Neural Networks

An improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters is provided.

The State of Sparsity in Deep Neural Networks

It is shown that unstructured sparse architectures learned through pruning cannot be trained from scratch to the same test set performance as a model trained with joint sparsification and optimization, and the need for large-scale benchmarks in the field of model compression is highlighted.

Rigging the Lottery: Making All Tickets Winners

This paper introduces a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods.

Selfish Sparse RNN Training

This paper proposes SNT-ASGD, a novel variant of the averaged stochastic gradient optimizer, which significantly improves the performance of all sparse training methods for RNNs, and achieves state-of-the-art sparse training results, better than the dense-to-sparse methods.

The Difficulty of Training Sparse Neural Networks

It is shown that, despite the failure of optimizers, there is a linear path with a monotonically decreasing objective from the initialization to the "good" solution, and traversing extra dimensions may be needed to escape stationary points found in the sparse subspace.

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

This work proves convergence rates of SGD to a global minimum and provides generalization guarantees for this global minimum that are independent of the network size, and shows that SGD can avoid overfitting despite the high capacity of the model.

Sparse Networks from Scratch: Faster Training without Losing Performance

This work develops sparse momentum, an algorithm which uses exponentially smoothed gradients (momentum) to identify layers and weights which reduce the error efficiently and shows that the benefits of momentum redistribution and growth increase with the depth and size of the network.