• Corpus ID: 235826162

# Global Optimality Beyond Two Layers: Training Deep ReLU Networks via Convex Programs

@inproceedings{Ergen2021GlobalOB,
title={Global Optimality Beyond Two Layers: Training Deep ReLU Networks via Convex Programs},
author={Tolga Ergen and Mert Pilanci},
booktitle={ICML},
year={2021}
}
• Published in ICML 11 October 2021
• Computer Science, Mathematics
Understanding the fundamental mechanism behind the success of deep neural networks is one of the key challenges in the modern machine learning literature. Despite numerous attempts, a solid theoretical analysis is yet to be developed. In this paper, we develop a novel unified framework to reveal a hidden regularization mechanism through the lens of convex optimization. We first show that the training of multiple threelayer ReLU sub-networks with weight decay regularization can be equivalently…
5 Citations

## Figures and Tables from this paper

Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks
• Computer Science, Mathematics
ArXiv
• 2021
It is proved polynomial-time trainability of path regularized ReLU networks with global optimality guarantees and the equivalent convex problem is regularized via a group sparsity inducing norm.
Efficient Global Optimization of Two-layer ReLU Networks: Quadratic-time Algorithms and Adversarial Training
• Yatong Bai
• Computer Science
• 2022
This work characterize the quality of this approximation and develop two efficient algorithms that train ANNs with global convergence guarantees that solve unconstrained convex formulations and converges to an approximately globally optimal classifier.
Vector-output ReLU Neural Network Problems are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-time Algorithms
• Computer Science, Mathematics
ICLR
• 2021
This work demonstrates how neural networks implicitly attempt to solve copositive programs via semi-nonnegative matrix factorization, and describes the first algorithms for provably finding the global minimum of the vector output neural network training problem.
The Convex Geometry of Backpropagation: Neural Network Gradient Flows Converge to Extreme Points of the Dual Convex Program
• Computer Science, Mathematics
ArXiv
• 2021
It is shown that the limit points of non-convex subgradient flows can be identified via primal-dual correspondence in this convex optimization problem, and a sufficient condition on the dual variables is derived which ensures that the stationary points of the non- Convex objective are the KKT points of an equivalent convex objective, thus proving convergence ofnon-concex gradient flows to the global optimum.
Neural networks with linear threshold activations: structure and algorithms
• Computer Science
ArXiv
• 2021
This article precisely characterize the class of functions that are representable by such neural networks and shows that 2 hidden layers are necessary and sufficient to represent any function representable in the class, a surprising result in the light of recent exact representability investigations for neural networks using other popular activation functions.

## References

SHOWING 1-10 OF 53 REFERENCES
Convex Duality of Deep Neural Networks
• Computer Science
ArXiv
• 2020
It is shown that a set of optimal hidden layer weight matrices for a norm regularized deep neural network training problem can be explicitly found as the extreme points of a convex set.
Convex Geometry of Two-Layer ReLU Networks: Implicit Autoencoding and Interpretable Models
• Computer Science
AISTATS
• 2020
A convex analytic framework for ReLU neural networks is developed which elucidates the inner workings of hidden neurons and their function space characteristics and establishes a connection to 0-1 equivalence for neural networks analogous to the minimal cardinality solutions in compressed sensing.
Convex Geometry and Duality of Over-parameterized Neural Networks
• Computer Science, Mathematics
ArXiv
• 2020
A convex analytic framework for ReLU neural networks is developed which elucidates the inner workings of hidden neurons and their function space characteristics and establishes a connection to $\ell_0$-$\ell_1$ equivalence for neural networks analogous to the minimal cardinality solutions in compressed sensing.
Global Optimality in Neural Network Training
• Mathematics, Computer Science
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
• 2017
There are sufficient conditions to guarantee that local minima are globally optimal and that a local descent strategy can reach a global minima from any initialization.
Vector-output ReLU Neural Network Problems are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-time Algorithms
• Computer Science, Mathematics
ICLR
• 2021
This work demonstrates how neural networks implicitly attempt to solve copositive programs via semi-nonnegative matrix factorization, and describes the first algorithms for provably finding the global minimum of the vector output neural network training problem.
Breaking the Curse of Dimensionality with Convex Neural Networks
• F. Bach
• Computer Science, Mathematics
J. Mach. Learn. Res.
• 2017
This work considers neural networks with a single hidden layer and non-decreasing homogeneous activa-tion functions like the rectified linear units and shows that they are adaptive to unknown underlying linear structures, such as the dependence on the projection of the input variables onto a low-dimensional subspace.
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
• Computer Science, Mathematics
ICML
• 2018
This paper suggests that, sometimes, increasing depth can speed up optimization and proves that it is mathematically impossible to obtain the acceleration effect of overparametrization via gradients of any regularizer.
Deep Neural Networks with Multi-Branch Architectures Are Intrinsically Less Non-Convex
• Computer Science
AISTATS
• 2019
This work provides strong guarantees of this quantity for two classes of network architectures, for the neural networks with arbitrary activation functions, multi-branch architecture and a variant of hinge loss, and shows that the duality gap of both population and empirical risks shrinks to zero as the number of branches increases.
Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs
• Mathematics, Computer Science
ICML
• 2017
This work provides the first global optimality guarantee of gradient descent on a convolutional neural network with ReLU activations, and shows that learning is NP-complete in the general case, but that when the input distribution is Gaussian, gradient descent converges to the global optimum in polynomial time.
On the Power of Over-parametrization in Neural Networks with Quadratic Activation
• Computer Science, Mathematics
ICML
• 2018
Despite that the number of parameters may exceed the sample size, using theory of Rademacher complexity, it is shown with weight decay, the solution also generalizes well if the data is sampled from a regular distribution such as Gaussian.