# Global Optimality Beyond Two Layers: Training Deep ReLU Networks via Convex Programs

@inproceedings{Ergen2021GlobalOB, title={Global Optimality Beyond Two Layers: Training Deep ReLU Networks via Convex Programs}, author={Tolga Ergen and Mert Pilanci}, booktitle={ICML}, year={2021} }

Understanding the fundamental mechanism behind the success of deep neural networks is one of the key challenges in the modern machine learning literature. Despite numerous attempts, a solid theoretical analysis is yet to be developed. In this paper, we develop a novel unified framework to reveal a hidden regularization mechanism through the lens of convex optimization. We first show that the training of multiple threelayer ReLU sub-networks with weight decay regularization can be equivalently…

## 5 Citations

Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks

- Computer Science, MathematicsArXiv
- 2021

It is proved polynomial-time trainability of path regularized ReLU networks with global optimality guarantees and the equivalent convex problem is regularized via a group sparsity inducing norm.

Efficient Global Optimization of Two-layer ReLU Networks: Quadratic-time Algorithms and Adversarial Training

- Computer Science
- 2022

This work characterize the quality of this approximation and develop two efficient algorithms that train ANNs with global convergence guarantees that solve unconstrained convex formulations and converges to an approximately globally optimal classifier.

Vector-output ReLU Neural Network Problems are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-time Algorithms

- Computer Science, MathematicsICLR
- 2021

This work demonstrates how neural networks implicitly attempt to solve copositive programs via semi-nonnegative matrix factorization, and describes the first algorithms for provably finding the global minimum of the vector output neural network training problem.

The Convex Geometry of Backpropagation: Neural Network Gradient Flows Converge to Extreme Points of the Dual Convex Program

- Computer Science, MathematicsArXiv
- 2021

It is shown that the limit points of non-convex subgradient flows can be identified via primal-dual correspondence in this convex optimization problem, and a sufficient condition on the dual variables is derived which ensures that the stationary points of the non- Convex objective are the KKT points of an equivalent convex objective, thus proving convergence ofnon-concex gradient flows to the global optimum.

Neural networks with linear threshold activations: structure and algorithms

- Computer ScienceArXiv
- 2021

This article precisely characterize the class of functions that are representable by such neural networks and shows that 2 hidden layers are necessary and sufficient to represent any function representable in the class, a surprising result in the light of recent exact representability investigations for neural networks using other popular activation functions.

## References

SHOWING 1-10 OF 53 REFERENCES

Convex Duality of Deep Neural Networks

- Computer ScienceArXiv
- 2020

It is shown that a set of optimal hidden layer weight matrices for a norm regularized deep neural network training problem can be explicitly found as the extreme points of a convex set.

Convex Geometry of Two-Layer ReLU Networks: Implicit Autoencoding and Interpretable Models

- Computer ScienceAISTATS
- 2020

A convex analytic framework for ReLU neural networks is developed which elucidates the inner workings of hidden neurons and their function space characteristics and establishes a connection to `0-`1 equivalence for neural networks analogous to the minimal cardinality solutions in compressed sensing.

Convex Geometry and Duality of Over-parameterized Neural Networks

- Computer Science, MathematicsArXiv
- 2020

A convex analytic framework for ReLU neural networks is developed which elucidates the inner workings of hidden neurons and their function space characteristics and establishes a connection to $\ell_0$-$\ell_1$ equivalence for neural networks analogous to the minimal cardinality solutions in compressed sensing.

Global Optimality in Neural Network Training

- Mathematics, Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017

There are sufficient conditions to guarantee that local minima are globally optimal and that a local descent strategy can reach a global minima from any initialization.

Vector-output ReLU Neural Network Problems are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-time Algorithms

- Computer Science, MathematicsICLR
- 2021

This work demonstrates how neural networks implicitly attempt to solve copositive programs via semi-nonnegative matrix factorization, and describes the first algorithms for provably finding the global minimum of the vector output neural network training problem.

Breaking the Curse of Dimensionality with Convex Neural Networks

- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2017

This work considers neural networks with a single hidden layer and non-decreasing homogeneous activa-tion functions like the rectified linear units and shows that they are adaptive to unknown underlying linear structures, such as the dependence on the projection of the input variables onto a low-dimensional subspace.

On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

- Computer Science, MathematicsICML
- 2018

This paper suggests that, sometimes, increasing depth can speed up optimization and proves that it is mathematically impossible to obtain the acceleration effect of overparametrization via gradients of any regularizer.

Deep Neural Networks with Multi-Branch Architectures Are Intrinsically Less Non-Convex

- Computer ScienceAISTATS
- 2019

This work provides strong guarantees of this quantity for two classes of network architectures, for the neural networks with arbitrary activation functions, multi-branch architecture and a variant of hinge loss, and shows that the duality gap of both population and empirical risks shrinks to zero as the number of branches increases.

Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs

- Mathematics, Computer ScienceICML
- 2017

This work provides the first global optimality guarantee of gradient descent on a convolutional neural network with ReLU activations, and shows that learning is NP-complete in the general case, but that when the input distribution is Gaussian, gradient descent converges to the global optimum in polynomial time.

On the Power of Over-parametrization in Neural Networks with Quadratic Activation

- Computer Science, MathematicsICML
- 2018

Despite that the number of parameters may exceed the sample size, using theory of Rademacher complexity, it is shown with weight decay, the solution also generalizes well if the data is sampled from a regular distribution such as Gaussian.