# Explicit regularization and implicit bias in deep network classifiers trained with the square loss

@article{Poggio2021ExplicitRA, title={Explicit regularization and implicit bias in deep network classifiers trained with the square loss}, author={Tomaso A. Poggio and Qianli Liao}, journal={ArXiv}, year={2021}, volume={abs/2101.00072} }

Deep ReLU networks trained with the square loss have been observed to perform well in classification tasks. We provide here a theoretical justification based on analysis of the associated gradient flow. We show that convergence to a solution with the absolute minimum norm is expected when normalization techniques such as Batch Normalization (BN) or Weight Normalization (WN) are used together with Weight Decay (WD). The main property of the minimizers that bounds their expected error is the norm…

## Figures from this paper

## 11 Citations

### On the Optimization Landscape of Neural Collapse under MSE Loss: Global Optimality with Unconstrained Features

- Computer ScienceICML
- 2022

Under a simplified unconstrained feature model, this work provides the first global landscape analysis for vanilla nonconvex MSE loss and shows that the (only!) global minimizers are neural collapse solutions, while all other critical points are strict saddles whose Hessian exhibit negative curvature directions.

### Do We Really Need a Learnable Classifier at the End of Deep Neural Network?

- Computer ScienceArXiv
- 2022

The analytical work based on the layer-peeled model indicates that the feature learning with a ﬁxed ETF classiﬁer naturally leads to the neural collapse state even when the dataset is imbalanced among classes.

### Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path

- Computer ScienceArXiv
- 2021

The recently discovered Neural Collapse phenomenon occurs pervasively in today’s deep net training paradigm of driving cross-entropy loss towards zero, and a new theoretical construct is introduced: the central path, where the linear classiﬁer stays MSE-optimal for feature activations throughout the dynamics.

### Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training

- Computer ScienceProceedings of the National Academy of Sciences
- 2021

The Layer-Peeled Model is introduced, a nonconvex, yet analytically tractable, optimization program that inherits many characteristics of well-trained neural networks, thereby offering an effective tool for explaining and predicting common empirical patterns of deep-learning training.

### Benign Overfitting in Multiclass Classification: All Roads Lead to Interpolation

- Computer ScienceNeurIPS
- 2021

This analysis shows that good generalization is possible for SVM solutions beyond the realm in which typical margin-based bounds apply, and derives novel error bounds on the accuracy of the MNI classifier.

### On the Role of Neural Collapse in Transfer Learning

- Computer ScienceArXiv
- 2021

It is demonstrated both theoretically and empirically that neural collapse generalizes to new samples from the training classes, and – more importantly – to new classes as well, allowing foundation models to provide feature maps that work well in transfer learning and, specifically, in the few-shot setting.

### Memorization-Dilation: Modeling Neural Collapse Under Noise

- Computer Science
- 2022

This work proposes a more realistic variant of the unconstrained feature representation that takes the limited expressivity of the network into account and reveals why label smoothing, a modification of cross-entropy empirically observed to produce a regularization effect, leads to improved generalization in classification tasks.

### An Unconstrained Layer-Peeled Perspective on Neural Collapse

- Computer ScienceArXiv
- 2021

This paper proves that gradient on this model converges to critical points of a minimum-norm separation problem exhibiting neural collapse in its global minimizer, and shows that these results also hold during the training of neural networks in real-world tasks when explicit regularization or weight decay is not used.

### Limitations of Neural Collapse for Understanding Generalization in Deep Learning

- PsychologyArXiv
- 2022

e recent work of Papyan, Han, and Donoho (2020) presented an intriguing “Neural Collapse” phenomenon, showing a structural property of interpolating classiers in the late stage of training. is…

### Layer-Peeled Model: Toward Understanding Well-Trained Deep Neural Networks

- Computer ScienceArXiv
- 2021

It is proved that any solution to this model forms a simplex equiangular tight frame, which in part explains the recently discovered phenomenon of neural collapse in deep learning training [PHD20].

## References

SHOWING 1-10 OF 16 REFERENCES

### Generalization in deep network classifiers trained with the square loss1

- Computer Science
- 2020

This version has corrected derivations of Neural Collapse for the multiclass, square loss case, and shows that convergence to a solution with the absolute minimum norm is expected when normalization techniques such as Batch Normalization [4] (BN) or Weight Normalization[5] (WN) are used together with Weight Decay (WD) and small initialization of the weights.

### Implicit Regularization in ReLU Networks with the Square Loss

- Computer ScienceCOLT
- 2021

It is proved that even for a single ReLU neuron, it is impossible to characterize the implicit regularization with the square loss by any explicit function of the model parameters, and a more general framework than the one considered so far may be needed to understand implicit regularizations for nonlinear predictors.

### Theoretical issues in deep networks

- Computer ScienceProceedings of the National Academy of Sciences
- 2020

It is proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can avoid the curse of dimensionality.

### Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

- Computer ScienceICLR
- 2019

It is shown that even if the authors fix the learning rate of scale-invariant parameters to a constant, gradient descent still approaches a stationary point in the rate of T^{-1/2}$ in iterations, asymptotically matching the best bound for gradient descent with well-tuned learning rates.

### Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

- Computer ScienceICML
- 2015

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

### Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

- Computer ScienceICLR
- 2020

The implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations, is studied, and it is proved that both the normalized margin and its smoothed version converge to the objective value at a KKT point of the optimization problem.

### The Implicit Bias of Depth: How Incremental Learning Drives Generalization

- Computer ScienceICLR
- 2020

The notion of incremental learning dynamics is defined and the conditions on depth and initialization for which this phenomenon arises in deep linear models are derived, proving that while shallow models can exhibit incrementallearning dynamics, they require the initialization to be exponentially small for these dynamics to present themselves.

### Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks

- Computer ScienceICLR
- 2021

It is argued that there is little compelling empirical or theoretical evidence indicating a clear-cut advantage to the cross-entropy loss, and that training using the square loss for classification needs to be a part of best practices of modern deep learning on equal footing with cross-ENTropy.

### Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

- Computer ScienceNIPS
- 2016

A reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction is presented, improving the conditioning of the optimization problem and speeding up convergence of stochastic gradient descent.

### Loss landscape: SGD can have a better view than GD

- Computer Science
- 2020

The main claim in this note is that while GD can converge to both types of critical points, SGD can only converge to the first kind, which include all global minima.