Implicit Regularization Towards Rank Minimization in ReLU Networks
@article{Timor2022ImplicitRT, title={Implicit Regularization Towards Rank Minimization in ReLU Networks}, author={Nadav Timor and Gal Vardi and Ohad Shamir}, journal={ArXiv}, year={2022}, volume={abs/2201.12760} }
We study the conjectured relationship between the implicit regularization in neural networks, trained with gradient-based methods, and rank minimization of their weight matrices. Previously, it was proved that for linear networks (of depth 2 and vector-valued outputs), gradient flow (GF) w.r.t. the square loss acts as a rank minimization heuristic. However, understanding to what extent this generalizes to nonlinear networks is an open problem. In this paper, we focus on nonlinear ReLU networks…
15 Citations
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data
- Computer ScienceArXiv
- 2022
This work investigates the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data.
Implicit Bias of Large Depth Networks: a Notion of Rank for Nonlinear Functions
- Mathematics, Computer ScienceArXiv
- 2022
It is shown that the representation cost of fully connected neural networks with homogeneous nonlinearities converges as the depth of the network goes to infinity to a notion of rank over nonlinear functions and that autoencoders with optimal nonlinear rank are naturally denoising.
On Margin Maximization in Linear and ReLU Networks
- Computer ScienceArXiv
- 2021
It is shown that in many cases, the KKT point is not even a local optimum of the max margin problem, and multiple settings where a local or global optimum can be guaranteed are identified.
Dynamics in Deep Classifiers trained with the Square Loss: normalization, low rank, neural collapse and generalization bounds
- Computer ScienceResearch
- 2023
The main property of the minimizers that bounds their expected error is ρ: it is proved that among all the close-to-interpolating solutions, the ones associated with smaller ρ have better margin and better bounds on the expected classification error.
Deep Classifiers trained with the Square Loss
- Computer Science
- 2022
It is shown that convergence to a solution with the absolute minimum ρ is expected when normalization by a Lagrange multiplier is used together with Weight Decay, and it is proved that SGD converges to solutions that have a bias towards 1) large margin (i.e. small ρ) and 2) low rank of the weight matrices.
SGD and Weight Decay Provably Induce a Low-Rank Bias in Neural Networks
- Computer Science
- 2022
It is proved that SGD noise must always be present, even asymptotically, as long as the authors incorporate weight decay and the batch size is smaller than the total number of training samples.
SGD Noise and Implicit Low-Rank Bias in Deep Neural Networks
- Computer ScienceArXiv
- 2022
It is shown, both theoretically and empirically, that when training a neural network using SGD with weight decay and small batch size, the resulting weight matrices are expected to be of small rank.
Operative dimensions in unconstrained connectivity of recurrent neural networks
- Computer SciencebioRxiv
- 2022
It is found that a weight matrix built from only a few operative dimensions is sufficient for the RNNs to operate with the original performance, implying that much of the high-dimensional structure of the trained connectivity is functionally irrelevant.
Truncated Matrix Completion - An Empirical Study
- Computer Science2022 30th European Signal Processing Conference (EUSIPCO)
- 2022
Through a series of experiments, this paper studies and compares the performance of various LRMC algorithms that were originally successful for data-independent sampling patterns and considers various settings where the sampling mask is dependent on the underlying data values.
On the Implicit Bias Towards Minimal Depth of Deep Neural Networks
- Computer Science
- 2022
It is demonstrated empirically that neural collapse extends beyond the penultimate layer and emerges in intermediate layers as well, and hypothesize and empirically show that gradient based methods are implicitly biased towards selecting neural networks of minimal depth for achieving this clustering property.
References
SHOWING 1-10 OF 45 REFERENCES
Implicit Regularization in ReLU Networks with the Square Loss
- Computer ScienceCOLT
- 2021
It is proved that even for a single ReLU neuron, it is impossible to characterize the implicit regularization with the square loss by any explicit function of the model parameters, and a more general framework than the one considered so far may be needed to understand implicit regularizations for nonlinear predictors.
Stable Rank Normalization for Improved Generalization in Neural Networks and GANs
- Computer ScienceICLR
- 2020
Stable rank normalization (SRN) is proposed, a novel, optimal, and computationally efficient weight-normalization scheme which minimizes the stable rank of a linear operator and can be shown to have a unique optimal solution.
Implicit Regularization in Tensor Factorization
- Computer ScienceICML
- 2021
Motivated by tensor rank capturing the implicit regularization of a non-linear neural network, this work empirically explores it as a measure of complexity, and finds that it captures the essence of datasets on which neural networks generalize, leading to a belief that tensorRank may pave way to explaining both implicitRegularization in deep learning and the properties of real-world data translating this implicitregularization to generalization.
Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning
- Computer ScienceICLR
- 2021
This work provides theoretical and empirical evidence that for depth-2 matrix factorization, gradient flow with infinitesimal initialization is mathematically equivalent to a simple heuristic rank minimization algorithm, Greedy Low-Rank Learning, under some reasonable assumptions.
Implicit Regularization in Deep Learning May Not Be Explainable by Norms
- Computer ScienceNeurIPS
- 2020
The results suggest that, rather than perceiving the implicit regularization via norms, a potentially more useful interpretation is minimization of rank, and it is demonstrated empirically that this interpretation extends to a certain class of non-linear neural networks, and hypothesize that it may be key to explaining generalization in deep learning.
Implicit Regularization in Deep Matrix Factorization
- Computer ScienceNeurIPS
- 2019
This work studies the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization, and finds that adding depth to a matrix factorizations enhances an implicit tendency towards low-rank solutions.
On Margin Maximization in Linear and ReLU Networks
- Computer ScienceArXiv
- 2021
It is shown that in many cases, the KKT point is not even a local optimum of the max margin problem, and multiple settings where a local or global optimum can be guaranteed are identified.
A Unifying View on Implicit Bias in Training Linear Neural Networks
- Computer Science, MathematicsICLR
- 2021
The implicit bias of gradient flow is studied on linear neural network training, and it is proved that gradient flow finds a global minimum which minimizes a norm-like function that interpolates between weighted $\ell_1$ and $\ell-2$ norms in the transformed input space.
Gradient descent aligns the layers of deep linear networks
- Computer ScienceICLR
- 2019
This paper establishes risk convergence and asymptotic weight matrix alignment --- a form of implicit regularization --- of gradient flow and gradient descent when applied to deep linear networks on…
Gradient Descent Maximizes the Margin of Homogeneous Neural Networks
- Computer ScienceICLR
- 2020
The implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations, is studied, and it is proved that both the normalized margin and its smoothed version converge to the objective value at a KKT point of the optimization problem.