# When will gradient methods converge to max‐margin classifier under ReLU models?

@article{Xu2018WhenWG,
title={When will gradient methods converge to max‐margin classifier under ReLU models?},
author={Tengyu Xu and Yi Zhou and Kaiyi Ji and Yingbin Liang},
journal={arXiv: Learning},
year={2018}
}
We study the implicit bias of gradient descent methods in solving a binary classification problem over a linearly separable dataset. The classifier is described by a nonlinear ReLU model and the objective function adopts the exponential loss function. We first characterize the landscape of the loss function and show that there can exist spurious asymptotic local minima besides asymptotic global minima. We then show that gradient descent (GD) can converge to either a global or a local max-margin… Expand
14 Citations
Gradient Descent Maximizes the Margin of Homogeneous Neural Networks
• Computer Science, Mathematics
• ICLR
• 2020
The implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations, is studied, and it is proved that both the normalized margin and its smoothed version converge to the objective value at a KKT point of the optimization problem. Expand
Convergence of Gradient Descent on Separable Data
• Mathematics, Computer Science
• AISTATS
• 2019
It is proved that the convergence rate could be improved to $\log (t) /\sqrt{t}$ by using aggressive step sizes that compensates for the rapidly vanishing gradients, which might be useful for deep networks. Expand
Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate
• Mathematics, Computer Science
• AISTATS
• 2019
It is proved that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate --- in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Expand
Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization
• Mathematics, Computer Science
• IEEE Transactions on Signal Processing
• 2019
This paper presents a novel stochastic gradient descent (SGD) algorithm, which can provably train any single-hidden-layer ReLU network to attain global optimality, despite the presence of infinitely many bad local minima, maxima, and saddle points in general. Expand
The Implicit Regularization for Adaptive Optimization Algorithms on Homogeneous Neural Networks
• Bohan Wang, Qi Meng, Wei Chen, Tie-Yan Liu
Despite their overwhelming capacity to overfit, deep neural networks trained by specific optimization algorithms tend to generalize well to unseen data. Recently, researchers explained it byExpand
HOMOGENEOUS NEURAL NETWORKS
• 2020
In this paper, we study the implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU orExpand
Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models
• Mathematics, Computer Science
• ICML
• 2019
The limit of loss minimization with a diverging norm constraint is studied, related to the limit of a "margin path" and characterize the resulting solution, which shows convergence to a "lexicographic max-margin solution" in both homogeneous and non-homogeneous models. Expand
Dynamics and Neural Collapse in Deep Classifiers trained with the Square Loss
Recent results suggest that square loss performs on par with cross-entropy loss in classification tasks for deep networks. While the theoretical understanding of training deep networks with theExpand
Understanding Estimation and Generalization Error of Generative Adversarial Networks
• Computer Science
• IEEE Transactions on Information Theory
• 2021
An upper bound as well as a minimax lower bound on the estimation error for training GANs are developed, which justifies the generalization ability of the GAN training via SGM after multiple passes over the data and reflects the interplay between the discriminator and the generator. Expand
Sampling Bias in Deep Active Classification: An Empirical Study
• Computer Science
• EMNLP/IJCNLP
• 2019
This work demonstrates that active set selection using the posterior entropy of deep models like FastText.zip (FTZ) is robust to sampling biases and to various algorithmic choices (query size and strategies) unlike that suggested by traditional literature and proposes a simple baseline for deep active text classification that outperforms the state of the art. Expand

#### References

SHOWING 1-10 OF 23 REFERENCES
The Implicit Bias of Gradient Descent on Separable Data
• Mathematics, Computer Science
• J. Mach. Learn. Res.
• 2018
We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of theExpand
Convergence of Gradient Descent on Separable Data
• Mathematics, Computer Science
• AISTATS
• 2019
It is proved that the convergence rate could be improved to $\log (t) /\sqrt{t}$ by using aggressive step sizes that compensates for the rapidly vanishing gradients, which might be useful for deep networks. Expand
Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate
• Mathematics, Computer Science
• AISTATS
• 2019
It is proved that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate --- in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Expand
Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression
• F. Bach
• Mathematics, Computer Science
• J. Mach. Learn. Res.
• 2014
After N iterations, with a constant step-size proportional to 1/R2√N where N is the number of observations and R is the maximum norm of the observations, the convergence rate is always of order O(1/ √N), and improves to O(R2/µN), which shows that averaged stochastic gradient is adaptive to unknown local strong convexity of the objective function. Expand
SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data
• Computer Science, Mathematics
• ICLR
• 2018
This work proves convergence rates of SGD to a global minimum and provides generalization guarantees for this global minimum that are independent of the network size, and shows that SGD can avoid overfitting despite the high capacity of the model. Expand
Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)
• Computer Science, Mathematics
• NIPS
• 2013
We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework whichExpand
Risk and parameter convergence of logistic regression
• Computer Science, Mathematics
• ArXiv
• 2018
The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate $\mathcal{O}(\ln t)^2 / \sqrt{t})$. Expand
Stochastic Convex Optimization
• Mathematics, Computer Science
• COLT
• 2009
Stochastic convex optimization is studied, and it is shown that the key ingredient is strong convexity and regularization, which is only a sufficient, but not necessary, condition for meaningful non-trivial learnability. Expand
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
• Computer Science, Mathematics
• NeurIPS
• 2018
It is proved that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels, when the data comes from mixtures of well-separated distributions. Expand
On the Learning Dynamics of Deep Neural Networks
• Computer Science, Mathematics
• ArXiv
• 2018
This work studies the case of binary classification and proves various properties of learning in such networks under strong assumptions such as linear separability of the data, and confirms empirical observations by proving that the classification error also follows a sigmoidal shape in nonlinear architectures. Expand