• Publications
  • Influence
A Convergence Theory for Deep Learning via Over-Parameterization
TLDR
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
TLDR
It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples.
A Latent Variable Model Approach to PMI-based Word Embeddings
TLDR
A new generative model is proposed, a dynamic version of the log-linear topic model of Mnih and Hinton (2007) to use the prior to compute closed form expressions for word statistics, and it is shown that latent word vectors are fairly uniformly dispersed in space.
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
TLDR
It is proved that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels, when the data comes from mixtures of well-separated distributions.
A Theoretical Analysis of NDCG Ranking Measures
TLDR
This paper studies, from a theoretical perspective, the Normalized Discounted Cumulative Gain (NDCG), a family of ranking measures widely used in practice, and shows that standard NDCG has consistent distinguishability although it converges to the same limit for all ranking functions.
Algorithmic Framework for Model-based Reinforcement Learning with Theoretical Guarantees
TLDR
A novel algorithmic framework for designing and analyzing model-based RL algorithms with theoretical guarantees is introduced and a meta-algorithm with a theoretical guarantee of monotone improvement to a local maximum of the expected reward is designed.
LoRA: Low-Rank Adaptation of Large Language Models
TLDR
Low-Rank Adaptation, or LoRA, is proposed, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
A Theoretical Analysis of NDCG Type Ranking Measures
TLDR
This paper studies, from a theoretical perspective, the widely used Normalized Discounted Cumulative Gain (NDCG)-type ranking measures, and shows that NDCG with logarithmic discount has consistent distinguishability although it converges to the same limit for all ranking functions.
Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations
TLDR
The gradient descent algorithm provides an implicit regularization effect in the learning of over-parameterized matrix factorization models and one-hidden-layer neural networks with quadratic activations and the results solve the conjecture of Gunasekar et al.
Neon2: Finding Local Minima via First-Order Oracles
We propose a reduction for non-convex optimization that can (1) turn an stationary-point finding algorithm into an local-minimum finding one, and (2) replace the Hessian-vector product computations
...
...