Feature Learning in L2-regularized DNNs: Attraction/Repulsion and Sparsity

  title={Feature Learning in L2-regularized DNNs: Attraction/Repulsion and Sparsity},
  author={Arthur Jacot and Eugene Golikov and Cl{\'e}ment Hongler and Franck Gabriel},
We study the loss surface of DNNs with L 2 regularization. We show that the loss in terms of the parameters can be reformulated into a loss in terms of the layerwise activations Z (cid:96) of the training set. This reformulation reveals the dynamics behind feature learning: each hidden representations Z (cid:96) are optimal w.r.t. to an attraction/repulsion problem and interpolate between the input and output representations, keeping as little information from the input as necessary to… 

Figures from this paper

Implicit Bias of Large Depth Networks: a Notion of Rank for Nonlinear Functions

It is shown that the representation cost of fully connected neural networks with homogeneous nonlinearities converges as the depth of the network goes to infinity to a notion of rank over nonlinear functions and that autoencoders with optimal nonlinear rank are naturally denoising.

A Better Way to Decay: Proximal Gradient Training Algorithms for Neural Nets

This paper argues that stochastic gradient descent (SGD) may be an inefficient algorithm for this objective of weight decay, and suggests a novel proximal gradient algorithm for network training.



Breaking the Curse of Dimensionality with Convex Neural Networks

  • F. Bach
  • Computer Science
    J. Mach. Learn. Res.
  • 2017
This work considers neural networks with a single hidden layer and non-decreasing homogeneous activa-tion functions like the rectified linear units and shows that they are adaptive to unknown underlying linear structures, such as the dependence on the projection of the input variables onto a low-dimensional subspace.

Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss

It is shown that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions.

Feature Learning in Infinite-Width Neural Networks

It is shown that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT, and any such infinite- width limit can be computed using the Tensor Programs technique.

Deep learning and the information bottleneck principle

It is argued that both the optimal architecture, number of layers and features/connections at each layer, are related to the bifurcation points of the information bottleneck tradeoff, namely, relevant compression of the input layer with respect to the output layer.

Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity

A Saddle-to-Saddle dynamics is conjecture: throughout training, gradient descent visits the neighborhoods of a sequence of saddles, each corresponding to linear maps of increasing rank, until reaching a sparse global minimum.

Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning

This work provides theoretical and empirical evidence that for depth-2 matrix factorization, gradient flow with infinitesimal initialization is mathematically equivalent to a simple heuristic rank minimization algorithm, Greedy Low-Rank Learning, under some reasonable assumptions.

Representation Costs of Linear Neural Networks: Analysis and Design

For different parameterizations (mappings from parameters to predictors), we study the regularization cost in predictor space induced by l2 regularization on the parameters (weights). We focus on

A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case

This paper characterize the norm required to realize a function as a single hidden-layer ReLU network with an unbounded number of units, but where the Euclidean norm of the weights is bounded, including precisely characterizing which functions can be realized with finite norm.

A Note on Lazy Training in Supervised Differentiable Programming

In a simplified setting, it is proved that "lazy training" essentially solves a kernel regression, and it is shown that this behavior is not so much due to over-parameterization than to a choice of scaling, often implicit, that allows to linearize the model around its initialization.