• Corpus ID: 235826162

Global Optimality Beyond Two Layers: Training Deep ReLU Networks via Convex Programs

@inproceedings{Ergen2021GlobalOB,
  title={Global Optimality Beyond Two Layers: Training Deep ReLU Networks via Convex Programs},
  author={Tolga Ergen and Mert Pilanci},
  booktitle={ICML},
  year={2021}
}
Understanding the fundamental mechanism behind the success of deep neural networks is one of the key challenges in the modern machine learning literature. Despite numerous attempts, a solid theoretical analysis is yet to be developed. In this paper, we develop a novel unified framework to reveal a hidden regularization mechanism through the lens of convex optimization. We first show that the training of multiple threelayer ReLU sub-networks with weight decay regularization can be equivalently… 

Figures and Tables from this paper

Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks
TLDR
It is proved polynomial-time trainability of path regularized ReLU networks with global optimality guarantees and the equivalent convex problem is regularized via a group sparsity inducing norm.
Fast Convex Optimization for Two-Layer ReLU Networks: Equivalent Model Classes and Cone Decompositions
TLDR
This work leverages a convex reformulation of the standard weight-decay penalized training problem as a set of group-`1-regularized data-local models, where locality is enforced by polyhedral cone constraints.
Efficient Global Optimization of Two-layer ReLU Networks: Quadratic-time Algorithms and Adversarial Training
TLDR
This work characterize the quality of this approximation and develop two efficient algorithms that train ANNs with global convergence guarantees that solve unconstrained convex formulations and converges to an approximately globally optimal classifier.
Vector-output ReLU Neural Network Problems are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-time Algorithms
TLDR
This work demonstrates how neural networks implicitly attempt to solve copositive programs via semi-nonnegative matrix factorization, and describes the first algorithms for provably finding the global minimum of the vector output neural network training problem.
Hidden Convexity of Wasserstein GANs: Interpretable Generative Models with Closed-Form Solutions
TLDR
This work analyzes the training of Wasserstein GANs with two-layer neural network discriminators through the lens of convex duality, and for a variety of generators expose the conditions under which Wassersteins can be solved exactly with convex optimization approaches, or can be represented as convex-concave games.
Practical Convex Formulations of One-hidden-layer Neural Network Adversarial Training
TLDR
It is proved that a stochastic approximation procedure that scales linearly yields high-quality solutions and can globally optimize neural networks and is derived from convex optimization models that efficiently perform adversarial training.
The Convex Geometry of Backpropagation: Neural Network Gradient Flows Converge to Extreme Points of the Dual Convex Program
TLDR
It is shown that the limit points of non-convex subgradient flows can be identified via primal-dual correspondence in this convex optimization problem, and a sufficient condition on the dual variables is derived which ensures that the stationary points of the non- Convex objective are the KKT points of an equivalent convex objective, thus proving convergence ofnon-concex gradient flows to the global optimum.
Neural networks with linear threshold activations: structure and algorithms
TLDR
This article precisely characterize the class of functions that are representable by such neural networks and shows that 2 hidden layers are necessary and sufficient to represent any function representable in the class, a surprising result in the light of recent exact representability investigations for neural networks using other popular activation functions.
H IDDEN C ONVEXITY OF W ASSERSTEIN GAN S : I NTERPRETABLE G ENERATIVE M ODELS WITH C LOSED -F ORM S OLUTIONS
TLDR
This work analyzes the training of Wasserstein GANs with two-layer neural network discriminators through the lens of convex duality, and for a variety of generators exposes the conditions under which Wassersteins can be solved exactly with convex optimization approaches, or can be represented as convex-concave games.
Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization
TLDR
An analytic framework based on convex duality is introduced to obtain exact convex representations of weight-decay regularized ReLU networks with BN, which can be trained in polynomial-time and shows that optimal layer weights can be obtained as simple closed-form for-mulas in the high-dimensional and/or overparameterized regimes.
...
1
2
...

References

SHOWING 1-10 OF 53 REFERENCES
Convex Duality of Deep Neural Networks
TLDR
It is shown that a set of optimal hidden layer weight matrices for a norm regularized deep neural network training problem can be explicitly found as the extreme points of a convex set.
Convex Geometry of Two-Layer ReLU Networks: Implicit Autoencoding and Interpretable Models
TLDR
A convex analytic framework for ReLU neural networks is developed which elucidates the inner workings of hidden neurons and their function space characteristics and establishes a connection to `0-`1 equivalence for neural networks analogous to the minimal cardinality solutions in compressed sensing.
Convex Geometry and Duality of Over-parameterized Neural Networks
TLDR
A convex analytic framework for ReLU neural networks is developed which elucidates the inner workings of hidden neurons and their function space characteristics and establishes a connection to $\ell_0$-$\ell_1$ equivalence for neural networks analogous to the minimal cardinality solutions in compressed sensing.
Global Optimality in Neural Network Training
  • B. Haeffele, R. Vidal
  • Computer Science
    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2017
TLDR
There are sufficient conditions to guarantee that local minima are globally optimal and that a local descent strategy can reach a global minima from any initialization.
Vector-output ReLU Neural Network Problems are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-time Algorithms
TLDR
This work demonstrates how neural networks implicitly attempt to solve copositive programs via semi-nonnegative matrix factorization, and describes the first algorithms for provably finding the global minimum of the vector output neural network training problem.
Breaking the Curse of Dimensionality with Convex Neural Networks
  • F. Bach
  • Computer Science
    J. Mach. Learn. Res.
  • 2017
TLDR
This work considers neural networks with a single hidden layer and non-decreasing homogeneous activa-tion functions like the rectified linear units and shows that they are adaptive to unknown underlying linear structures, such as the dependence on the projection of the input variables onto a low-dimensional subspace.
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
TLDR
This paper suggests that, sometimes, increasing depth can speed up optimization and proves that it is mathematically impossible to obtain the acceleration effect of overparametrization via gradients of any regularizer.
Deep Neural Networks with Multi-Branch Architectures Are Intrinsically Less Non-Convex
TLDR
This work provides strong guarantees of this quantity for two classes of network architectures, for the neural networks with arbitrary activation functions, multi-branch architecture and a variant of hinge loss, and shows that the duality gap of both population and empirical risks shrinks to zero as the number of branches increases.
Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs
TLDR
This work provides the first global optimality guarantee of gradient descent on a convolutional neural network with ReLU activations, and shows that learning is NP-complete in the general case, but that when the input distribution is Gaussian, gradient descent converges to the global optimum in polynomial time.
On the Power of Over-parametrization in Neural Networks with Quadratic Activation
TLDR
Despite that the number of parameters may exceed the sample size, using theory of Rademacher complexity, it is shown with weight decay, the solution also generalizes well if the data is sampled from a regular distribution such as Gaussian.
...
1
2
3
4
5
...