# Reintroducing Straight-Through Estimators as Principled Methods for Stochastic Binary Networks

@inproceedings{Yanush2020ReintroducingSE, title={Reintroducing Straight-Through Estimators as Principled Methods for Stochastic Binary Networks}, author={Viktor Yanush and Alexander Shekhovtsov and Dmitry Molchanov and Dmitry P. Vetrov}, booktitle={German Conference on Pattern Recognition}, year={2020} }

Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights. Many successful experimental results have been recently achieved using the empirical straight-through estimation approach. This approach has generated a variety of ad-hoc rules for propagating gradients through non-differentiable activations and updating discrete weights. We put such methods on a solid basis by obtaining them as…

## 10 Citations

### Initialization and Transfer Learning of Stochastic Binary Networks from Real-Valued Ones

- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
- 2021

This work initialize SBNs from real-valued networks with ReLU activations, and proposes that closely approximating their internal features can provide a good initialization for SBN.

### Bias-Variance Tradeoffs in Single-Sample Binary Gradient Estimators

- Computer ScienceLecture Notes in Computer Science
- 2021

A theoretical analysis of Bias and Variance of several straightthrough estimator methods is conducted in order to understand tradeoffs and verify the originally claimed properties.

### Understanding weight-magnitude hyperparameters in training binary networks

- Computer ScienceArXiv
- 2023

Improved understanding is offered of how magnitude-based hyperparameters influence the training of binary networks which allows for new optimization filters specifically designed for binary neural networks that are independent of their real-valued interpretation.

### S<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="d1e2584" altimg="si15.svg"><mml:msup><mml:mrow /><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msup></mml:math>NN: Time step reduction of spiking surrogate gradients for training energy efficient single-step spiking neural

- Computer ScienceNeural Networks
- 2022

### AskewSGD : An Annealed interval-constrained Optimisation method to train Quantized Neural Networks

- Computer ScienceArXiv
- 2022

A new algorithm, Annealed Skewed SGD - ASkewSGD - for training deep neural networks (DNNs) with quantized weights, which performs better than or on par with state of the art methods in classical benchmarks.

### Binarizing Split Learning for Data Privacy Enhancement and Computation Reduction

- Computer ScienceArXiv
- 2022

The binarized SL (B-SL) model can reduce privacy leakage from SL smashed data with merely a small degradation in model accuracy and are promising for lightweight IoT/mobile applications with high privacy-preservation requirements such as mobile healthcare applications.

### Bit-wise Training of Neural Network Weights

- Computer ScienceArXiv
- 2022

We introduce an algorithm where the individual bits representing the weights of a neural network are learned. This method allows training weights with integer values on arbitrary bit-depths and…

### S2NN: Time Step Reduction of Spiking Surrogate Gradients for Training Energy Efficient Single-Step Neural Networks

- Computer ScienceArXiv
- 2022

It is experimentally demonstrated that the obtained neuron model enables SNN to train more accurately and energy-efficiently than existing neuron models for SNNs and BNNs, and it was shown that the proposed SNN could achieve comparable accuracy to full-precision networks while being highly energy- efficient.

### S$^3$NN: Time Step Reduction of Spiking Surrogate Gradients for Training Energy Efficient Single-Step Spiking Neural Networks

- Computer Science
- 2022

A single-step spiking neural network (S$^3$NN), an energy-efficient neural network with low computational cost and high precision, is proposed by reducing the surrogate gradient for multi-time step SNNs to a single- time step.

### Structural Credit Assignment in Neural Networks using Reinforcement Learning

- Computer ScienceNeurIPS
- 2021

This work formalizes training a neural network as a finite-horizon reinforcement learning problem and introduces an off-policy approach, to facilitate reasoning about the greedy action for other agents and help overcome stochasticity in other agents.

## References

SHOWING 1-10 OF 67 REFERENCES

### Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks

- Computer ScienceNeurIPS
- 2020

A new method is proposed for this estimation problem combining sampling and analytic approximation steps which has a significantly reduced variance at the price of a small bias which gives a very practical tradeoff in comparison with existing unbiased and biased estimators.

### Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

- Computer ScienceArXiv
- 2013

This work considers a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network.

### Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets

- Computer ScienceICLR
- 2019

It is proved that if the STE is properly chosen, the expected coarse gradient correlates positively with the population gradient (not available for the training), and its negation is a descent direction for minimizing the population loss.

### Techniques for Learning Binary Stochastic Feedforward Neural Networks

- Computer ScienceICLR
- 2015

This work confirms that training stochastic networks is difficult and proposes two new estimators that perform favorably among all the five known estimators, and proposes benchmark tests for comparing training algorithms.

### Straight-Through Estimator as Projected Wasserstein Gradient Flow

- Computer ScienceArXiv
- 2019

It is shown that ST can be interpreted as the simulation of the projected Wasserstein gradient flow (pWGF), and a theoretical foundation is established to justify the convergence properties of ST.

### Adam: A Method for Stochastic Optimization

- Computer ScienceICLR
- 2015

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

### Training Discrete-Valued Neural Networks with Sign Activations Using Weight Distributions

- Computer ScienceECML/PKDD
- 2019

This paper builds on the framework of probabilistic forward propagations using the local reparameterization trick, where instead of training a single set of NN weights the authors rather train a distribution over these weights, and achieves state-of-the-art performance.

### ProxQuant: Quantized Neural Networks via Proximal Operators

- Computer ScienceICLR
- 2019

This work proposes a more principled alternative approach, called ProxQuant, that formulates quantized network training as a regularized learning problem instead and optimizes it via the prox-gradient method, challenging the indispensability of the straight-through gradient method and providing a powerful alternative.

### Probabilistic Binary Neural Networks

- Computer ScienceArXiv
- 2018

This work presents a probabilistic training method for Neural Network with both binary weights and activations, called BLRNet, which introduces stochastic versions of Batch Normalization and max pooling, which transfer well to a deterministic network at test time.

### Practical Variational Inference for Neural Networks

- Computer ScienceNIPS
- 2011

This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks and revisits several common regularisers from a variational perspective.