Corpus ID: 220936062

Implicit Regularization in Deep Learning: A View from Function Space

@article{Baratin2020ImplicitRI,
  title={Implicit Regularization in Deep Learning: A View from Function Space},
  author={Aristide Baratin and Thomas George and C{\'e}sar Laurent and R. Devon Hjelm and Guillaume Lajoie and Pascal Vincent and Simon Lacoste-Julien},
  journal={ArXiv},
  year={2020},
  volume={abs/2008.00938}
}
We approach the problem of implicit regularization in deep learning from a geometrical viewpoint. We highlight a possible regularization effect induced by a dynamical alignment of the neural tangent features introduced by Jacot et al, along a small number of task-relevant directions. By extrapolating a new analysis of Rademacher complexity bounds in linear models, we propose and study a new heuristic complexity measure for neural networks which captures this phenomenon, in terms of sequences of… Expand
Gradient Starvation: A Learning Proclivity in Neural Networks
TLDR
This work provides a theoretical explanation for the emergence of feature imbalance in neural networks and develops guarantees for a novel regularization method aimed at decoupling feature learning dynamics, improving accuracy and robustness in cases hindered by gradient starvation. Expand

References

SHOWING 1-10 OF 63 REFERENCES
Geometry of Optimization and Implicit Regularization in Deep Learning
TLDR
This work argues that the optimization plays a crucial role in generalization of deep learning models through implicit regularization, and demonstrates how changing the empirical optimization procedure can improve generalization, even if actual optimization quality is not affected. Expand
Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks
TLDR
This work studies the discrete gradient dynamics of the training of a two-layer linear network with the least-squares loss using a time rescaling to show that this dynamics sequentially learns the solutions of a reduced-rank regression with a gradually increasing rank. Expand
Weighted Optimization: better generalization by smoother interpolation
TLDR
It is argued through this model and numerical experiments that normalization methods in deep learning such as weight normalization improve generalization in overparameterized neural networks by implicitly encouraging smooth interpolants. Expand
On the Inductive Bias of Neural Tangent Kernels
TLDR
This work studies smoothness, approximation, and stability properties of functions with finite norm, including stability to image deformations in the case of convolutional networks, and compares to other known kernels for similar architectures. Expand
Characterizing Implicit Bias in Terms of Optimization Geometry
TLDR
The question of whether the specific global minimum reached by an algorithm can be characterized in terms of the potential or norm of the optimization geometry, and independently of hyperparameter choices such as step-size and momentum is explored. Expand
Neural tangent kernel: convergence and generalization in neural networks (invited paper)
TLDR
This talk will introduce this formalism and give a number of results on the Neural Tangent Kernel and explain how they give us insight into the dynamics of neural networks during training and into their generalization features. Expand
In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning
TLDR
It is argued, partially through analogy to matrix factorization, that this is an inductive bias that can help shed light on deep learning. Expand
Neural Spectrum Alignment: Empirical Study
TLDR
This paper empirically explore properties of NTK along the optimization and shows that in practical applications the NTK changes in a very dramatic and meaningful way, with its top eigenfunctions aligning toward the target function learned by NN. Expand
A Note on Lazy Training in Supervised Differentiable Programming
TLDR
In a simplified setting, it is proved that "lazy training" essentially solves a kernel regression, and it is shown that this behavior is not so much due to over-parameterization than to a choice of scaling, often implicit, that allows to linearize the model around its initialization. Expand
Coherent Gradients: An Approach to Understanding Generalization in Gradient Descent-based Optimization
TLDR
This work proposes an approach to answering why neural networks trained with Gradient Descent generalize well on real datasets even though they are capable of fitting random data based on a hypothesis about the dynamics of gradient descent that is called Coherent Gradients. Expand
...
1
2
3
4
5
...