• Corpus ID: 203593172

Over-parameterization as a Catalyst for Better Generalization of Deep ReLU network

@article{Tian2019OverparameterizationAA,
  title={Over-parameterization as a Catalyst for Better Generalization of Deep ReLU network},
  author={Yuandong Tian},
  journal={ArXiv},
  year={2019},
  volume={abs/1909.13458}
}
To analyze deep ReLU network, we adopt a student-teacher setting in which an over-parameterized student network learns from the output of a fixed teacher network of the same depth, with Stochastic Gradient Descent (SGD). Our contributions are two-fold. First, we prove that when the gradient is zero (or bounded above by a small constant) at every data point in training, a situation called \emph{interpolation setting}, there exists many-to-one \emph{alignment} between student and teacher nodes in… 
Sharp Rate of Convergence for Deep Neural Network Classifiers under the Teacher-Student Setting
TLDR
A teacher-student framework is proposed that assumes the Bayes classifier to be expressed as ReLU neural networks, and a sharp rate of convergence is obtained, i.e., Õd(n −2/3)∗, for classifiers trained using either 0-1 loss or hinge loss.
Optimal Rate of Convergence for Deep Neural Network Classifiers under the Teacher-Student Setting
TLDR
A teacher-student framework is proposed that assumes the Bayes classifier to be expressed as ReLU neural networks, and a sharp rate of convergence is obtained, i.e., $\tilde{O}_d(n^{-2/3})$ for classifiers trained using either 0-1 loss or hinge loss.
Self-Distillation: Towards Efficient and Compact Neural Networks.
TLDR
Self-distillation attaches several attention modules and shallow classifiers at different depths of neural networks and distills knowledge from the deepest classifier to the shallower classifiers, which allows the neural network to work in a dynamic manner and leads to a much higher acceleration.
Implicit Regularization of Normalization Methods
TLDR
It is shown that WN and rPGD adaptively regularize the weights and the minimum norm solution to the minimum $\ell_2$ norm solution even for initializations, different from the behavior of gradient descent, which only converges to the min norm solution when started at zero, and is more sensitive to initialization.
The Most Used Activation Functions: Classic Versus Current
TLDR
This paper is an overview of the most used activation functions, classic functions and current functions as well that are among the most known artificial intelligence activation functions in the research of Machine learning and Deep Learning as well.
Implicit Regularization and Convergence for Weight Normalization
TLDR
WN and rPGD reparametrize the weights with a scale g and a unit vector w and thus the objective function becomes non-convex and it is shown that these methods adaptively regularize the weights and converge close to the minimum l2 norm solution, even for initializations far from zero.
Recovery and Generalization in Over-Realized Dictionary Learning
TLDR
It is shown that an efficient and provably correct distillation mechanism can be employed to recover the correct atoms from the over-realized model and the meta-algorithm provides dictionary estimates with consistently better recovery of the ground-truth model.

References

SHOWING 1-10 OF 75 REFERENCES
Luck Matters: Understanding Training Dynamics of Deep ReLU Networks
TLDR
Using a teacher-student setting, a novel relationship between the gradient received by hidden student nodes and the activations of teacher nodes for deep ReLU networks is discovered and it is proved that student nodes whose weights are initialized to be close to teacher nodes converge to them at a faster rate.
A Convergence Theory for Deep Learning via Over-Parameterization
TLDR
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.
Gradient descent optimizes over-parameterized deep ReLU networks
TLDR
The key idea of the proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent.
Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup
TLDR
The results indicate that achieving good generalisation in neural networks goes beyond the properties of SGD alone and depends on the interplay of at least the algorithm, the model architecture, and the data set.
An Improved Analysis of Training Over-parameterized Deep Neural Networks
TLDR
An improved analysis of the global convergence of (stochastic) gradient descent for training deep neural networks, which only requires a milder over-parameterization condition than previous work in terms of the training sample size and other problem-dependent parameters is provided.
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning
TLDR
The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss close to zero, so it is still unclear why these interpolated solutions perform well on test data.
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
TLDR
It is proved that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels, when the data comes from mixtures of well-separated distributions.
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
TLDR
It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples.
Recovery Guarantees for One-hidden-layer Neural Networks
TLDR
This work distill some properties of activation functions that lead to local strong convexity in the neighborhood of the ground-truth parameters for the 1NN squared-loss objective, and provides recovery guarantees for 1NNs with both sample complexity and computational complexity $\mathit{linear}$ in the input dimension and $\math it{logarithmic}$in the precision.
Rethinking the Value of Network Pruning
TLDR
It is found that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization, and the need for more careful baseline evaluations in future research on structured pruning methods is suggested.
...
1
2
3
4
5
...