• Corpus ID: 246430285

# Implicit Regularization Towards Rank Minimization in ReLU Networks

@article{Timor2022ImplicitRT,
title={Implicit Regularization Towards Rank Minimization in ReLU Networks},
journal={ArXiv},
year={2022},
volume={abs/2201.12760}
}
• Published 30 January 2022
• Computer Science
• ArXiv
We study the conjectured relationship between the implicit regularization in neural networks, trained with gradient-based methods, and rank minimization of their weight matrices. Previously, it was proved that for linear networks (of depth 2 and vector-valued outputs), gradient ﬂow (GF) w.r.t. the square loss acts as a rank minimization heuristic. However, understanding to what extent this generalizes to nonlinear networks is an open problem. In this paper, we focus on nonlinear ReLU networks…

## Figures from this paper

• Computer Science
ArXiv
• 2022
This work investigates the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data.
It is shown that the representation cost of fully connected neural networks with homogeneous nonlinearities converges as the depth of the network goes to inﬁnity to a notion of rank over nonlinear functions and that autoencoders with optimal nonlinear rank are naturally denoising.
• Computer Science
ArXiv
• 2021
It is shown that in many cases, the KKT point is not even a local optimum of the max margin problem, and multiple settings where a local or global optimum can be guaranteed are identified.
• Computer Science
Research
• 2023
The main property of the minimizers that bounds their expected error is ρ: it is proved that among all the close-to-interpolating solutions, the ones associated with smaller ρ have better margin and better bounds on the expected classification error.
• Computer Science
• 2022
It is shown that convergence to a solution with the absolute minimum ρ is expected when normalization by a Lagrange multiplier is used together with Weight Decay, and it is proved that SGD converges to solutions that have a bias towards 1) large margin (i.e. small ρ) and 2) low rank of the weight matrices.
• Computer Science
• 2022
It is proved that SGD noise must always be present, even asymptotically, as long as the authors incorporate weight decay and the batch size is smaller than the total number of training samples.
• Computer Science
ArXiv
• 2022
It is shown, both theoretically and empirically, that when training a neural network using SGD with weight decay and small batch size, the resulting weight matrices are expected to be of small rank.
• Computer Science
bioRxiv
• 2022
It is found that a weight matrix built from only a few operative dimensions is sufficient for the RNNs to operate with the original performance, implying that much of the high-dimensional structure of the trained connectivity is functionally irrelevant.
• Computer Science
2022 30th European Signal Processing Conference (EUSIPCO)
• 2022
Through a series of experiments, this paper studies and compares the performance of various LRMC algorithms that were originally successful for data-independent sampling patterns and considers various settings where the sampling mask is dependent on the underlying data values.
• Computer Science
• 2022
It is demonstrated empirically that neural collapse extends beyond the penultimate layer and emerges in intermediate layers as well, and hypothesize and empirically show that gradient based methods are implicitly biased towards selecting neural networks of minimal depth for achieving this clustering property.

## References

SHOWING 1-10 OF 45 REFERENCES

• Computer Science
COLT
• 2021
It is proved that even for a single ReLU neuron, it is impossible to characterize the implicit regularization with the square loss by any explicit function of the model parameters, and a more general framework than the one considered so far may be needed to understand implicit regularizations for nonlinear predictors.
• Computer Science
ICLR
• 2020
Stable rank normalization (SRN) is proposed, a novel, optimal, and computationally efficient weight-normalization scheme which minimizes the stable rank of a linear operator and can be shown to have a unique optimal solution.
• Computer Science
ICML
• 2021
Motivated by tensor rank capturing the implicit regularization of a non-linear neural network, this work empirically explores it as a measure of complexity, and finds that it captures the essence of datasets on which neural networks generalize, leading to a belief that tensorRank may pave way to explaining both implicitRegularization in deep learning and the properties of real-world data translating this implicitregularization to generalization.
• Computer Science
ICLR
• 2021
This work provides theoretical and empirical evidence that for depth-2 matrix factorization, gradient flow with infinitesimal initialization is mathematically equivalent to a simple heuristic rank minimization algorithm, Greedy Low-Rank Learning, under some reasonable assumptions.
• Computer Science
NeurIPS
• 2020
The results suggest that, rather than perceiving the implicit regularization via norms, a potentially more useful interpretation is minimization of rank, and it is demonstrated empirically that this interpretation extends to a certain class of non-linear neural networks, and hypothesize that it may be key to explaining generalization in deep learning.
• Computer Science
NeurIPS
• 2019
This work studies the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization, and finds that adding depth to a matrix factorizations enhances an implicit tendency towards low-rank solutions.
• Computer Science
ArXiv
• 2021
It is shown that in many cases, the KKT point is not even a local optimum of the max margin problem, and multiple settings where a local or global optimum can be guaranteed are identified.
• Computer Science, Mathematics
ICLR
• 2021
The implicit bias of gradient flow is studied on linear neural network training, and it is proved that gradient flow finds a global minimum which minimizes a norm-like function that interpolates between weighted $\ell_1$ and $\ell-2$ norms in the transformed input space.
• Computer Science
ICLR
• 2019
This paper establishes risk convergence and asymptotic weight matrix alignment --- a form of implicit regularization --- of gradient flow and gradient descent when applied to deep linear networks on
• Computer Science
ICLR
• 2020
The implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations, is studied, and it is proved that both the normalized margin and its smoothed version converge to the objective value at a KKT point of the optimization problem.