• Corpus ID: 211171365

The Geometry of Sign Gradient Descent

  title={The Geometry of Sign Gradient Descent},
  author={Lukas Balles and Fabian Pedregosa and Nicolas Le Roux},
Sign-based optimization methods have become popular in machine learning due to their favorable communication cost in distributed optimization and their surprisingly good performance in neural network training. Furthermore, they are closely connected to so-called adaptive gradient methods like Adam. Recent works on signSGD have used a non-standard "separable smoothness" assumption, whereas some older works study sign gradient descent as steepest descent with respect to the $\ell_\infty$-norm. In… 

Figures and Tables from this paper

On Faster Convergence of Scaled Sign Gradient Descent

This paper investigates faster convergence for a variant of sign-based gradient descent, called scaled SIGNGD, in three cases: 1) the objective function is strongly convex; 2) the Objective function is nonconvex but satisfies the Polyak-Łojasiewicz (PL) inequality; 3) the gradient is stochastic,called scaled SIGNSGD in this case.

Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noise

Experiments show that, while the proposed framework is more general than existing studies of SGD under heavy-tail noise, several easy-to-implement nonlinearities from the framework are competitive with state of the art alternatives on real data sets with heavy tail noises.

First-Order Optimization Inspired from Finite-Time Convergent Flows

This paper proposes an Euler discretization for these rescaled-gradient and signed-gradient optimization algorithms, and provides convergence guarantees, in the deterministic and the stochastic setting, and shows that their schemes demonstrate faster convergences against standard optimization alternatives.

Revealing and Protecting Labels in Distributed Training

This work proposes a method to discover the set of labels of training samples from only the gradient of the last layer and the id to label mapping, and demonstrates the effectiveness of this method for model training in two domains - image classification, and automatic speech recognition.


  • Computer Science, Mathematics
  • 2022
Experiments show that, while the proposed framework is more general than existing studies of SGD under heavy-tail noise, several easy-to-implement nonlinearities from the framework are competitive with state of the art alternatives on real data sets with heavy tail noises.

Hard to Forget: Poisoning Attacks on Certified Machine Unlearning

It is demonstrated how an attacker can exploit this oversight, highlighting a novel attack surface introduced by machine unlearning, and an attacker aiming to increase the computational cost of data removal is considered.

Online Training of Spiking Recurrent Neural Networks with Phase-Change Memory Synapses

A simulation framework of differential-architecture crossbar arrays based on an accurate and comprehensive Phase Change Memory (PCM) device model is presented and it is demonstrated that accumulating gradients can enable online and efficient training of spiking RNNs on memristive substrates.



signSGD: compressed optimisation for non-convex problems

SignSGD can get the best of both worlds: compressed gradients and SGD-level convergence rate, and the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models.

Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape and compares favorably to state-of-the-art techniques in terms of generalization error and training time.

On Stochastic Sign Descent Methods

This paper performs a general analysis of sign-based methods for non-convex optimization and assures exponentially fast variance reduction with respect to number of nodes, maintaining 1-bit compression in both directions and using small mini-batch sizes.

Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity

It is shown that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks, and positively correlates with the gradient norm, and contrary to standard assumptions in the literature.

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition

This work shows that this much-older Polyak-Lojasiewicz (PL) inequality is actually weaker than the main conditions that have been explored to show linear convergence rates without strong convexity over the last 25 years, leading to simple proofs of linear convergence of these methods.

Beyond Convexity: Stochastic Quasi-Convex Optimization

This paper analyzes a stochastic version of NGD and proves its convergence to a global minimum for a wider class of functions: it requires the functions to be quasi-convex and locally-Lipschitz.


The normalized gradient methods having constant step size with occasionally decay, such as SGD with momentum, have better performance in the deep convolution neural networks, while those with adaptive step sizes perform better in recurrent neural networks.

Stochastic Spectral Descent for Discrete Graphical Models

A new, largely tuning-free algorithm that derives novel majorization bounds based on the Schatten- ∞ norm and demonstrates empirically that this algorithm leads to dramatically faster training and improved predictive ability compared to stochastic gradient descent for both directed and undirected graphical models.

Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization

New empirical observations and theoretical results on both the optimization dynamics and generalization behavior of SGD for deep nets based on the Hessian of the training loss and associated quantities are presented.