# Reintroducing Straight-Through Estimators as Principled Methods for Stochastic Binary Networks

@inproceedings{Yanush2020ReintroducingSE,
title={Reintroducing Straight-Through Estimators as Principled Methods for Stochastic Binary Networks},
author={Viktor Yanush and Alexander Shekhovtsov and Dmitry Molchanov and Dmitry P. Vetrov},
booktitle={German Conference on Pattern Recognition},
year={2020}
}
• Published in
German Conference on Pattern…
11 June 2020
• Computer Science
Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights. Many successful experimental results have been recently achieved using the empirical straight-through estimation approach. This approach has generated a variety of ad-hoc rules for propagating gradients through non-differentiable activations and updating discrete weights. We put such methods on a solid basis by obtaining them as…
10 Citations
• Computer Science
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
• 2021
This work initialize SBNs from real-valued networks with ReLU activations, and proposes that closely approximating their internal features can provide a good initialization for SBN.
A theoretical analysis of Bias and Variance of several straightthrough estimator methods is conducted in order to understand tradeoffs and verify the originally claimed properties.
• Computer Science
ArXiv
• 2023
Improved understanding is offered of how magnitude-based hyperparameters influence the training of binary networks which allows for new optimization filters specifically designed for binary neural networks that are independent of their real-valued interpretation.
• Computer Science
ArXiv
• 2022
A new algorithm, Annealed Skewed SGD - ASkewSGD - for training deep neural networks (DNNs) with quantized weights, which performs better than or on par with state of the art methods in classical benchmarks.
• Computer Science
ArXiv
• 2022
The binarized SL (B-SL) model can reduce privacy leakage from SL smashed data with merely a small degradation in model accuracy and are promising for lightweight IoT/mobile applications with high privacy-preservation requirements such as mobile healthcare applications.
We introduce an algorithm where the individual bits representing the weights of a neural network are learned. This method allows training weights with integer values on arbitrary bit-depths and
• Computer Science
ArXiv
• 2022
It is experimentally demonstrated that the obtained neuron model enables SNN to train more accurately and energy-efficiently than existing neuron models for SNNs and BNNs, and it was shown that the proposed SNN could achieve comparable accuracy to full-precision networks while being highly energy- efficient.
• Computer Science
• 2022
A single-step spiking neural network (S$^3$NN), an energy-efficient neural network with low computational cost and high precision, is proposed by reducing the surrogate gradient for multi-time step SNNs to a single- time step.
• Computer Science
NeurIPS
• 2021
This work formalizes training a neural network as a finite-horizon reinforcement learning problem and introduces an off-policy approach, to facilitate reasoning about the greedy action for other agents and help overcome stochasticity in other agents.

## References

SHOWING 1-10 OF 67 REFERENCES

• Computer Science
NeurIPS
• 2020
A new method is proposed for this estimation problem combining sampling and analytic approximation steps which has a significantly reduced variance at the price of a small bias which gives a very practical tradeoff in comparison with existing unbiased and biased estimators.
• Computer Science
ArXiv
• 2013
This work considers a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network.
• Computer Science
ICLR
• 2019
It is proved that if the STE is properly chosen, the expected coarse gradient correlates positively with the population gradient (not available for the training), and its negation is a descent direction for minimizing the population loss.
• Computer Science
ICLR
• 2015
This work confirms that training stochastic networks is difficult and proposes two new estimators that perform favorably among all the five known estimators, and proposes benchmark tests for comparing training algorithms.
It is shown that ST can be interpreted as the simulation of the projected Wasserstein gradient flow (pWGF), and a theoretical foundation is established to justify the convergence properties of ST.
• Computer Science
ICLR
• 2015
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
• Computer Science
ECML/PKDD
• 2019
This paper builds on the framework of probabilistic forward propagations using the local reparameterization trick, where instead of training a single set of NN weights the authors rather train a distribution over these weights, and achieves state-of-the-art performance.
• Computer Science
ICLR
• 2019
This work proposes a more principled alternative approach, called ProxQuant, that formulates quantized network training as a regularized learning problem instead and optimizes it via the prox-gradient method, challenging the indispensability of the straight-through gradient method and providing a powerful alternative.
• Computer Science
ArXiv
• 2018
This work presents a probabilistic training method for Neural Network with both binary weights and activations, called BLRNet, which introduces stochastic versions of Batch Normalization and max pooling, which transfer well to a deterministic network at test time.
This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks and revisits several common regularisers from a variational perspective.