• Publications
  • Influence
Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
TLDR
A binary matrix multiplication GPU kernel is written with which it is possible to run the MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.
Binarized Neural Networks
TLDR
A binary matrix multiplication GPU kernel is written with which it is possible to run the MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.
Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
TLDR
A binary matrix multiplication GPU kernel is programmed with which it is possible to run the MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.
The Implicit Bias of Gradient Descent on Separable Data
We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the
Train longer, generalize better: closing the generalization gap in large batch training of neural networks
TLDR
This work proposes a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior and presents a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates.
Characterizing Implicit Bias in Terms of Optimization Geometry
TLDR
The question of whether the specific global minimum reached by an algorithm can be characterized in terms of the potential or norm of the optimization geometry, and independently of hyperparameter choices such as step-size and momentum is explored.
Post training 4-bit quantization of convolutional networks for rapid-deployment
TLDR
This paper introduces the first practical 4-bit post training quantization approach: it does not involve training the quantized model (fine-tuning), nor it requires the availability of the full dataset, and achieves accuracy that is just a few percents less the state-of-the-art baseline across a wide range of convolutional models.
Implicit Bias of Gradient Descent on Linear Convolutional Networks
We show that gradient descent on full-width linear convolutional networks of depth $L$ converges to a linear predictor related to the $\ell_{2/L}$ bridge penalty in the frequency domain. This is in
Scalable Methods for 8-bit Training of Neural Networks
TLDR
This work is the first to quantize the weights, activations, as well as a substantial volume of the gradients stream, in all layers (including batch normalization) to 8-bit while showing state-of-the-art results over the ImageNet-1K dataset.
...
...