• Corpus ID: 246430285

Implicit Regularization Towards Rank Minimization in ReLU Networks

@article{Timor2022ImplicitRT,
  title={Implicit Regularization Towards Rank Minimization in ReLU Networks},
  author={Nadav Timor and Gal Vardi and Ohad Shamir},
  journal={ArXiv},
  year={2022},
  volume={abs/2201.12760}
}
We study the conjectured relationship between the implicit regularization in neural networks, trained with gradient-based methods, and rank minimization of their weight matrices. Previously, it was proved that for linear networks (of depth 2 and vector-valued outputs), gradient flow (GF) w.r.t. the square loss acts as a rank minimization heuristic. However, understanding to what extent this generalizes to nonlinear networks is an open problem. In this paper, we focus on nonlinear ReLU networks… 

Figures from this paper

Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data

This work investigates the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data.

Implicit Bias of Large Depth Networks: a Notion of Rank for Nonlinear Functions

It is shown that the representation cost of fully connected neural networks with homogeneous nonlinearities converges as the depth of the network goes to infinity to a notion of rank over nonlinear functions and that autoencoders with optimal nonlinear rank are naturally denoising.

On Margin Maximization in Linear and ReLU Networks

It is shown that in many cases, the KKT point is not even a local optimum of the max margin problem, and multiple settings where a local or global optimum can be guaranteed are identified.

Dynamics in Deep Classifiers trained with the Square Loss: normalization, low rank, neural collapse and generalization bounds

The main property of the minimizers that bounds their expected error is ρ: it is proved that among all the close-to-interpolating solutions, the ones associated with smaller ρ have better margin and better bounds on the expected classification error.

Deep Classifiers trained with the Square Loss

It is shown that convergence to a solution with the absolute minimum ρ is expected when normalization by a Lagrange multiplier is used together with Weight Decay, and it is proved that SGD converges to solutions that have a bias towards 1) large margin (i.e. small ρ) and 2) low rank of the weight matrices.

SGD and Weight Decay Provably Induce a Low-Rank Bias in Neural Networks

It is proved that SGD noise must always be present, even asymptotically, as long as the authors incorporate weight decay and the batch size is smaller than the total number of training samples.

SGD Noise and Implicit Low-Rank Bias in Deep Neural Networks

It is shown, both theoretically and empirically, that when training a neural network using SGD with weight decay and small batch size, the resulting weight matrices are expected to be of small rank.

Operative dimensions in unconstrained connectivity of recurrent neural networks

It is found that a weight matrix built from only a few operative dimensions is sufficient for the RNNs to operate with the original performance, implying that much of the high-dimensional structure of the trained connectivity is functionally irrelevant.

Truncated Matrix Completion - An Empirical Study

Through a series of experiments, this paper studies and compares the performance of various LRMC algorithms that were originally successful for data-independent sampling patterns and considers various settings where the sampling mask is dependent on the underlying data values.

On the Implicit Bias Towards Minimal Depth of Deep Neural Networks

It is demonstrated empirically that neural collapse extends beyond the penultimate layer and emerges in intermediate layers as well, and hypothesize and empirically show that gradient based methods are implicitly biased towards selecting neural networks of minimal depth for achieving this clustering property.

References

SHOWING 1-10 OF 45 REFERENCES

Implicit Regularization in ReLU Networks with the Square Loss

It is proved that even for a single ReLU neuron, it is impossible to characterize the implicit regularization with the square loss by any explicit function of the model parameters, and a more general framework than the one considered so far may be needed to understand implicit regularizations for nonlinear predictors.

Stable Rank Normalization for Improved Generalization in Neural Networks and GANs

Stable rank normalization (SRN) is proposed, a novel, optimal, and computationally efficient weight-normalization scheme which minimizes the stable rank of a linear operator and can be shown to have a unique optimal solution.

Implicit Regularization in Tensor Factorization

Motivated by tensor rank capturing the implicit regularization of a non-linear neural network, this work empirically explores it as a measure of complexity, and finds that it captures the essence of datasets on which neural networks generalize, leading to a belief that tensorRank may pave way to explaining both implicitRegularization in deep learning and the properties of real-world data translating this implicitregularization to generalization.

Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning

This work provides theoretical and empirical evidence that for depth-2 matrix factorization, gradient flow with infinitesimal initialization is mathematically equivalent to a simple heuristic rank minimization algorithm, Greedy Low-Rank Learning, under some reasonable assumptions.

Implicit Regularization in Deep Learning May Not Be Explainable by Norms

The results suggest that, rather than perceiving the implicit regularization via norms, a potentially more useful interpretation is minimization of rank, and it is demonstrated empirically that this interpretation extends to a certain class of non-linear neural networks, and hypothesize that it may be key to explaining generalization in deep learning.

Implicit Regularization in Deep Matrix Factorization

This work studies the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization, and finds that adding depth to a matrix factorizations enhances an implicit tendency towards low-rank solutions.

On Margin Maximization in Linear and ReLU Networks

It is shown that in many cases, the KKT point is not even a local optimum of the max margin problem, and multiple settings where a local or global optimum can be guaranteed are identified.

A Unifying View on Implicit Bias in Training Linear Neural Networks

The implicit bias of gradient flow is studied on linear neural network training, and it is proved that gradient flow finds a global minimum which minimizes a norm-like function that interpolates between weighted $\ell_1$ and $\ell-2$ norms in the transformed input space.

Gradient descent aligns the layers of deep linear networks

This paper establishes risk convergence and asymptotic weight matrix alignment --- a form of implicit regularization --- of gradient flow and gradient descent when applied to deep linear networks on

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

The implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations, is studied, and it is proved that both the normalized margin and its smoothed version converge to the objective value at a KKT point of the optimization problem.