Fitting ReLUs via SGD and Quantized SGD

  title={Fitting ReLUs via SGD and Quantized SGD},
  author={Seyed Mohammadreza Mousavi Kalan and Mahdi Soltanolkotabi and Amir Salman Avestimehr},
  journal={2019 IEEE International Symposium on Information Theory (ISIT)},
In this paper we focus on the problem of finding the optimal weights of the shallowest of neural networks consisting of a single Rectified Linear Unit (ReLU). These functions are of the form x → max(0, 〈w, x〉) with w ∈ ℝd denoting the weight vector. We focus on a planted i model where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to a planted weight vector. We first show that mini-batch stochastic gradient descent when suitably initialized… 

Figures and Tables from this paper

Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?

This paper demonstrates the utility of the general theory of (stochastic) gradient descent for a variety of problem domains spanning low-rank matrix recovery to neural network training and develops novel martingale techniques that guarantee SGD never leaves a small neighborhood of the initialization, even with rather large learning rates.

Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization

This paper presents a novel stochastic gradient descent (SGD) algorithm, which can provably train any single-hidden-layer ReLU network to attain global optimality, despite the presence of infinitely many bad local minima, maxima, and saddle points in general.

A Study of Neural Training with Iterative Non-Gradient Methods

A simple stochastic algorithm is given that can train a ReLU gate in the realizable setting in linear time while using significantly milder conditions on the data distribution than previous results, and approximate recovery of the true label generating parameters is shown.

A Study of Neural Training with Non-Gradient and Noise Assisted Gradient Methods

This work demonstrates provable guarantees on the training of depth-2 neural networks in new regimes than previously explored and shows near-optimal guarantees of training a ReLU gate when an adversary is allowed to corrupt the true labels.

Understanding How Over-Parametrization Leads to Acceleration: A case of learning a single teacher neuron

It is provably show that over-parametrization helps the iterate generated by gradient descent to enter the neighborhood of a global optimal solution that achieves zero testing error faster and study how the scaling of the output neurons affects the convergence time.

Learning a Single Neuron with Bias Using Gradient Descent

A detailed study of the fundamental problem of learning a single neuron with a bias term in the realizable setting with the ReLU activation, using gradient descent, characterizing the critical points of the objective, demonstrating failure cases, and providing positive convergence guarantees under different sets of assumptions.

ReLU Regression with Massart Noise

An efficient algorithm is developed that achieves exact parameter recovery in this model under mild anti-concentration assumptions on the underlying distribution, necessary for exact recovery to be information-theoretically possible.

An Approximation Algorithm for training One-Node ReLU Neural Network

An approximation algorithm to solve One-Node-ReLU whose running time is $\mathcal{O}(n^k)$ where $n$ is the number of samples, $k$ is a predefined integral constant and this algorithm does not require pre-processing or tuning of parameters.

Learning a Single Neuron for Non-monotonic Activation Functions

This work establishes learnability without assuming monotonicity of a single neuron x (cid:55)→ σ ( w T x ) with gradient descent (GD) when the input distribution is the standard Gaussian, and shows that mild conditions on σ are enough to guarantee the learnability in polynomial time andPolynomial samples.



Learning ReLUs via Gradient Descent

This paper shows that projected gradient descent, when initialization at 0, converges at a linear rate to the planted model with a number of samples that is optimal up to numerical constants.

Learning ReLU Networks via Alternating Minimization

It is shown that under standard distributional assumptions on the $d-$dimensional input data, the proposed algorithm provably recovers the true `ground truth' parameters in a linearly convergent fashion and empirically demonstrate its convergence to a global minimum.

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

A convergence analysis for SGD is provided on a rich subset of two-layer feedforward networks with ReLU activations characterized by a special structure called "identity mapping" that proves that, if input follows from Gaussian distribution, with standard $O(1/\sqrt{d})$ initialization of the weights, SGD converges to the global minimum in polynomial number of steps.

signSGD: compressed optimisation for non-convex problems

SignSGD can get the best of both worlds: compressed gradients and SGD-level convergence rate, and the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models.

Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization

This paper presents a novel stochastic gradient descent (SGD) algorithm, which can provably train any single-hidden-layer ReLU network to attain global optimality, despite the presence of infinitely many bad local minima, maxima, and saddle points in general.

Local Geometry of One-Hidden-Layer Neural Networks for Logistic Regression

This work proves that under Gaussian input, the empirical risk function employing quadratic loss exhibits strong convexity and smoothness uniformly in a local neighborhood of the ground truth, for a class of smooth activation functions satisfying certain properties, including sigmoid and tanh, as soon as the sample complexity is sufficiently large.

Learning One Convolutional Layer with Overlapping Patches

Convotron is given the first provably efficient algorithm for learning a one hidden layer convolutional network with respect to a general class of (potentially overlapping) patches and it is proved that the framework captures commonly used schemes from computer vision, including one-dimensional and two-dimensional "patch and stride" convolutions.

Learning One-hidden-layer ReLU Networks via Gradient Descent

It is proved that tensor initialization followed by gradient descent can converge to the ground-truth parameters at a linear rate up to some statistical error.

Stochastic Gradient Descent Learns State Equations with Nonlinear Activations

It is proved that SGD estimate linearly converges to the ground truth weights while using near-optimal sample size and a novel SGD convergence result with nonlinear activations is published.

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques.