• Publications
  • Influence
Exploring Generalization in Deep Learning
TLDR
This work considers several recently suggested explanations for what drives generalization in deep networks, including norm-based control, sharpness and robustness, and investigates how these measures explain different observed phenomena.
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
TLDR
The empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and ResNet-50 training with very little hyperparameter tuning, and the optimizer enables use of very large batch sizes of 32868 without any degradation of performance.
A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks
TLDR
A generalization bound for feedforward neural networks is presented in terms of the product of the spectral norm of the layers and the Frobeniusnorm of the weights using a PAC-Bayes analysis.
Implicit Regularization in Matrix Factorization
TLDR
It is conjecture and theoretical evidence that with small enough step sizes and initialization close enough to the origin, gradient descent on a full dimensional factorization converges to the minimum nuclear norm solution.
Global Optimality of Local Search for Low Rank Matrix Recovery
TLDR
It is shown that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements, which yields a polynomial time global convergence guarantee for stochastic gradient descent.
Dropping Convexity for Faster Semi-definite Optimization
TLDR
This is the first paper to provide precise convergence rate guarantees for general convex functions under standard convex assumptions and to provide a procedure to initialize FGD for (restricted) strongly convex objectives and when one only has access to f via a first-order oracle.
Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks
TLDR
A novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound for two layer ReLU networks and a matching lower bound for the Rademacher complexity that improves over previous capacity lower bounds for neural networks are presented.
Understanding Robustness of Transformers for Image Classification
TLDR
It is found that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations, and Transformers are robust to the removal of almost any single layer.
Completing any low-rank matrix, provably
TLDR
It is shown that any low-rank matrix can be exactly recovered from as few as $O(nr \log^2 n)$ randomly chosen elements, provided this random choice is made according to a {\em specific biased distribution}: the probability of any element being sampled should be proportional to the sum of the leverage scores of the corresponding row, and column.
Universal Matrix Completion
TLDR
This work shows that if the set of sampled indices come from the edges of a bipartite graph with large spectral gap, then the nuclear norm minimization based method exactly recovers all low-rank matrices that satisfy certain incoherence properties.
...
1
2
3
4
...