Over-parameterization as a Catalyst for Better Generalization of Deep ReLU network
@article{Tian2019OverparameterizationAA, title={Over-parameterization as a Catalyst for Better Generalization of Deep ReLU network}, author={Yuandong Tian}, journal={ArXiv}, year={2019}, volume={abs/1909.13458} }
To analyze deep ReLU network, we adopt a student-teacher setting in which an over-parameterized student network learns from the output of a fixed teacher network of the same depth, with Stochastic Gradient Descent (SGD). Our contributions are two-fold. First, we prove that when the gradient is zero (or bounded above by a small constant) at every data point in training, a situation called \emph{interpolation setting}, there exists many-to-one \emph{alignment} between student and teacher nodes in…
Figures from this paper
7 Citations
Sharp Rate of Convergence for Deep Neural Network Classifiers under the Teacher-Student Setting
- Computer Science
- 2020
A teacher-student framework is proposed that assumes the Bayes classifier to be expressed as ReLU neural networks, and a sharp rate of convergence is obtained, i.e., Õd(n −2/3)∗, for classifiers trained using either 0-1 loss or hinge loss.
Optimal Rate of Convergence for Deep Neural Network Classifiers under the Teacher-Student Setting
- Computer ScienceArXiv
- 2020
A teacher-student framework is proposed that assumes the Bayes classifier to be expressed as ReLU neural networks, and a sharp rate of convergence is obtained, i.e., $\tilde{O}_d(n^{-2/3})$ for classifiers trained using either 0-1 loss or hinge loss.
Self-Distillation: Towards Efficient and Compact Neural Networks.
- Computer ScienceIEEE transactions on pattern analysis and machine intelligence
- 2021
Self-distillation attaches several attention modules and shallow classifiers at different depths of neural networks and distills knowledge from the deepest classifier to the shallower classifiers, which allows the neural network to work in a dynamic manner and leads to a much higher acceleration.
Implicit Regularization of Normalization Methods
- Computer ScienceArXiv
- 2019
It is shown that WN and rPGD adaptively regularize the weights and the minimum norm solution to the minimum $\ell_2$ norm solution even for initializations, different from the behavior of gradient descent, which only converges to the min norm solution when started at zero, and is more sensitive to initialization.
The Most Used Activation Functions: Classic Versus Current
- Computer Science2020 International Conference on Development and Application Systems (DAS)
- 2020
This paper is an overview of the most used activation functions, classic functions and current functions as well that are among the most known artificial intelligence activation functions in the research of Machine learning and Deep Learning as well.
Implicit Regularization and Convergence for Weight Normalization
- Computer ScienceNeurIPS
- 2020
WN and rPGD reparametrize the weights with a scale g and a unit vector w and thus the objective function becomes non-convex and it is shown that these methods adaptively regularize the weights and converge close to the minimum l2 norm solution, even for initializations far from zero.
Recovery and Generalization in Over-Realized Dictionary Learning
- Computer ScienceArXiv
- 2020
It is shown that an efficient and provably correct distillation mechanism can be employed to recover the correct atoms from the over-realized model and the meta-algorithm provides dictionary estimates with consistently better recovery of the ground-truth model.
References
SHOWING 1-10 OF 75 REFERENCES
Luck Matters: Understanding Training Dynamics of Deep ReLU Networks
- Computer ScienceArXiv
- 2019
Using a teacher-student setting, a novel relationship between the gradient received by hidden student nodes and the activations of teacher nodes for deep ReLU networks is discovered and it is proved that student nodes whose weights are initialized to be close to teacher nodes converge to them at a faster rate.
A Convergence Theory for Deep Learning via Over-Parameterization
- Computer ScienceICML
- 2019
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.
Gradient descent optimizes over-parameterized deep ReLU networks
- Computer ScienceMachine Learning
- 2019
The key idea of the proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent.
Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup
- Computer ScienceNeurIPS
- 2019
The results indicate that achieving good generalisation in neural networks goes beyond the properties of SGD alone and depends on the interplay of at least the algorithm, the model architecture, and the data set.
An Improved Analysis of Training Over-parameterized Deep Neural Networks
- Computer ScienceNeurIPS
- 2019
An improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters is provided.
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning
- Computer ScienceICML
- 2018
The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss close to zero, so it is still unclear why these interpolated solutions perform well on test data.
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
- Computer ScienceNeurIPS
- 2018
It is proved that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels, when the data comes from mixtures of well-separated distributions.
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
- Computer ScienceNeurIPS
- 2019
It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples.
Recovery Guarantees for One-hidden-layer Neural Networks
- Computer ScienceICML
- 2017
This work distill some properties of activation functions that lead to local strong convexity in the neighborhood of the ground-truth parameters for the 1NN squared-loss objective, and provides recovery guarantees for 1NNs with both sample complexity and computational complexity $\mathit{linear}$ in the input dimension and $\math it{logarithmic}$in the precision.
Rethinking the Value of Network Pruning
- Computer ScienceICLR
- 2019
It is found that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization, and the need for more careful baseline evaluations in future research on structured pruning methods is suggested.