# On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

@article{Arora2018OnTO, title={On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization}, author={Sanjeev Arora and Nadav Cohen and Elad Hazan}, journal={ArXiv}, year={2018}, volume={abs/1802.06509} }

Conventional wisdom in deep learning states that increasing depth improves expressiveness but complicates optimization. This paper suggests that, sometimes, increasing depth can speed up optimization. The effect of depth on optimization is decoupled from expressiveness by focusing on settings where additional layers amount to overparameterization - linear neural networks, a well-studied model. Theoretical analysis, as well as experiments, show that here depth acts as a preconditioner which may…

## 281 Citations

Global Optimality Beyond Two Layers: Training Deep ReLU Networks via Convex Programs

- Computer ScienceICML
- 2021

It is proved that the equivalent convex problem can be globally optimized by a standard convex optimization solver with a polynomial-time complexity with respect to the number of samples and data dimension when the width of the network is fixed.

Optimization of Graph Neural Networks: Implicit Acceleration by Skip Connections and More Depth

- Computer ScienceICML
- 2021

This work analyzes linearized GNNs and proves that despite the non-convexity of training, convergence to a global minimum at a linear rate is guaranteed under mild assumptions that are validated on real-world graphs.

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

- Computer ScienceICLR
- 2019

Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum.

Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning

- Computer ScienceICLR
- 2021

An implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping is identified: when value functions are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network.

On the Convergence of Deep Networks with Sample Quadratic Overparameterization

- Computer ScienceArXiv
- 2021

A tight finite-width Neural Tangent Kernel (NTK) equivalence is derived, suggesting that neural networks trained with this technique generalize well at least as good as its NTK, and it can be used to study generalization as well.

Overparameterized Nonlinear Optimization with Applications to Neural Nets

- Computer Science2019 13th International conference on Sampling Theory and Applications (SampTA)
- 2019

This talk shows that solution found by first order methods, such as gradient descent, has the property that it has near shortest distance to the initialization of the algorithm among all other solutions, and advocates that shortest distance property can be a good proxy for the simplest explanation.

More is Less: Inducing Sparsity via Overparameterization

- Computer ScienceArXiv
- 2021

In order to reconstruct a vector from underdetermined linear measurements, it is shown that, if there exists an exact solution, vanilla gradient for the overparameterized loss functional converges to a good approximation of the solution of minimal (cid:96) 1 -norm.

FACTORIZED NEURAL LAYERS

- Computer Science
- 2021

Factorized layers—operations parameterized by products of two or more matrices—occur in a variety of deep learning contexts, including compressed model training, certain types of knowledge distillation, and multi-head selfattention architectures, and deep nets containing such layers are studied.

Finite-Sum Optimization: A New Perspective for Convergence to a Global Solution

- Computer ScienceArXiv
- 2022

By using bounded style assumptions, this work proves convergence to an ε-(global) minimum using Õ(1/ε3) gradient computations and broadens the understanding of why and under what circumstances training of a DNN converges to a global minimum.

Implicit Regularization in Deep Matrix Factorization

- Computer ScienceNeurIPS
- 2019

This work studies the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization, and finds that adding depth to a matrix factorizations enhances an implicit tendency towards low-rank solutions.

## References

SHOWING 1-10 OF 45 REFERENCES

On the Expressive Power of Deep Neural Networks

- Computer ScienceICML
- 2017

We propose a new approach to the problem of neural network expressivity, which seeks to characterize how structural properties of a neural network family affect the functions it is able to compute.…

On the Quality of the Initial Basin in Overspecified Neural Networks

- Computer ScienceICML
- 2016

This work studies thegeometric structure of the associated non-convex objective function, in the context of ReLU networks and starting from a random initialization of the network parameters, and identifies some conditions under which it becomes more favorable to optimization.

Identity Matters in Deep Learning

- Computer ScienceICLR
- 2017

This work gives a strikingly simple proof that arbitrarily deep linear residual networks have no spurious local optima and shows that residual networks with ReLu activations have universal finite-sample expressivity in the sense that the network can represent any function of its sample provided that the model has more parameters than the sample size.

Global Optimality in Tensor Factorization, Deep Learning, and Beyond

- Computer ScienceArXiv
- 2015

This framework derives sufficient conditions to guarantee that a local minimum of the non-convex optimization problem is a global minimum and shows that if the size of the factorized variables is large enough then from any initialization it is possible to find a global minimizer using a purely local descent algorithm.

Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

- Computer Science
- 2017

This work proposes a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions.

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

- Computer ScienceJ. Mach. Learn. Res.
- 2010

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

Qualitatively characterizing neural network optimization problems

- Computer ScienceICLR
- 2015

A simple analysis technique is introduced to look for evidence that state-of-the-art neural networks are overcoming local optima, and finds that, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.

Analysis and Design of Convolutional Networks via Hierarchical Tensor Decompositions

- Computer ScienceArXiv
- 2017

This paper overviews a series of works written by the authors, that through an equivalence to hierarchical tensor decompositions, analyze the expressive efficiency and inductive bias of various convolutional network architectural features (depth, width, strides and more).

No bad local minima: Data independent training error guarantees for multilayer neural networks

- Computer ScienceArXiv
- 2016

It is proved that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization, and extended to the case of more than onehidden layer.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

- Computer ScienceICML
- 2015

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.