# Feature Learning in L2-regularized DNNs: Attraction/Repulsion and Sparsity

@article{Jacot2022FeatureLI, title={Feature Learning in L2-regularized DNNs: Attraction/Repulsion and Sparsity}, author={Arthur Jacot and Eugene Golikov and Cl{\'e}ment Hongler and Franck Gabriel}, journal={ArXiv}, year={2022}, volume={abs/2205.15809} }

We study the loss surface of DNNs with L 2 regularization. We show that the loss in terms of the parameters can be reformulated into a loss in terms of the layerwise activations Z (cid:96) of the training set. This reformulation reveals the dynamics behind feature learning: each hidden representations Z (cid:96) are optimal w.r.t. to an attraction/repulsion problem and interpolate between the input and output representations, keeping as little information from the input as necessary to…

## 2 Citations

### Implicit Bias of Large Depth Networks: a Notion of Rank for Nonlinear Functions

- Mathematics, Computer ScienceArXiv
- 2022

It is shown that the representation cost of fully connected neural networks with homogeneous nonlinearities converges as the depth of the network goes to inﬁnity to a notion of rank over nonlinear functions and that autoencoders with optimal nonlinear rank are naturally denoising.

### A Better Way to Decay: Proximal Gradient Training Algorithms for Neural Nets

- Computer ScienceArXiv
- 2022

This paper argues that stochastic gradient descent (SGD) may be an inefﬁcient algorithm for this objective of weight decay, and suggests a novel proximal gradient algorithm for network training.

## References

SHOWING 1-10 OF 25 REFERENCES

### Breaking the Curse of Dimensionality with Convex Neural Networks

- Computer ScienceJ. Mach. Learn. Res.
- 2017

This work considers neural networks with a single hidden layer and non-decreasing homogeneous activa-tion functions like the rectified linear units and shows that they are adaptive to unknown underlying linear structures, such as the dependence on the projection of the input variables onto a low-dimensional subspace.

### Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss

- Computer ScienceCOLT
- 2020

It is shown that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions.

### Feature Learning in Infinite-Width Neural Networks

- Computer ScienceArXiv
- 2020

It is shown that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT, and any such infinite- width limit can be computed using the Tensor Programs technique.

### Deep learning and the information bottleneck principle

- Computer Science2015 IEEE Information Theory Workshop (ITW)
- 2015

It is argued that both the optimal architecture, number of layers and features/connections at each layer, are related to the bifurcation points of the information bottleneck tradeoff, namely, relevant compression of the input layer with respect to the output layer.

### Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity

- Computer Science
- 2021

A Saddle-to-Saddle dynamics is conjecture: throughout training, gradient descent visits the neighborhoods of a sequence of saddles, each corresponding to linear maps of increasing rank, until reaching a sparse global minimum.

### Adam: A Method for Stochastic Optimization

- Computer ScienceICLR
- 2015

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

### Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning

- Computer ScienceICLR
- 2021

This work provides theoretical and empirical evidence that for depth-2 matrix factorization, gradient flow with infinitesimal initialization is mathematically equivalent to a simple heuristic rank minimization algorithm, Greedy Low-Rank Learning, under some reasonable assumptions.

### Representation Costs of Linear Neural Networks: Analysis and Design

- MathematicsNeurIPS
- 2021

For different parameterizations (mappings from parameters to predictors), we study the regularization cost in predictor space induced by l2 regularization on the parameters (weights). We focus on…

### A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case

- Computer ScienceICLR
- 2020

This paper characterize the norm required to realize a function as a single hidden-layer ReLU network with an unbounded number of units, but where the Euclidean norm of the weights is bounded, including precisely characterizing which functions can be realized with finite norm.

### A Note on Lazy Training in Supervised Differentiable Programming

- Computer ScienceArXiv
- 2018

In a simplified setting, it is proved that "lazy training" essentially solves a kernel regression, and it is shown that this behavior is not so much due to over-parameterization than to a choice of scaling, often implicit, that allows to linearize the model around its initialization.