Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR Prediction

  title={Adaptive Optimizers with Sparse Group Lasso for Neural Networks in CTR Prediction},
  author={Yun Yue and Yongchao Liu and Suo Tong and Minghao Li and Zhen Zhang and Chunyang Wen and Huanjun Bao and Lihong Gu and Jinjie Gu and Yixiang Mu},
. We develop a novel framework that adds the regularizers of the sparse group lasso to a family of adaptive optimizers in deep learning, such as Momentum , Adagrad , Adam , AMSGrad , AdaHessian , and create a new class of optimizers, which are named Group Momentum , Group Adagrad , Group Adam , Group AMSGrad and Group AdaHessian , etc., accordingly. We establish theoretically proven convergence guarantees in the stochastic convex settings, based on primal-dual methods. We evaluate the… 



Online Learning for Group Lasso

A novel online learning algorithm for the group lasso that performs in an online mode and scales well: at each iteration one can update the weight vector according to a closed-form solution based on the average of previous subgradients.

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

ADAHESSIAN is a new stochastic optimization algorithm that directly incorporates approximate curvature information from the loss function, and it includes several novel performance-improving features, including a fast Hutchinson based method to approximate the curvature matrix with low computational overhead.

Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Deep & Cross Network for Ad Click Predictions

This paper proposes the Deep & Cross Network (DCN), which keeps the benefits of a DNN model, and beyond that, it introduces a novel cross network that is more efficient in learning certain bounded-degree feature interactions.

To prune, or not to prune: exploring the efficacy of pruning for model compression

Across a broad range of neural network architectures, large-sparse models are found to consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.

Deep Learning Recommendation Model for Personalization and Recommendation Systems

A state-of-the-art deep learning recommendation model (DLRM) is developed and its implementation in both PyTorch and Caffe2 frameworks is provided and a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out compute from the fully-connected layers is designed.

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

Shampoo: Preconditioned Stochastic Tensor Optimization

This work describes and analyzes a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces, which establishes convergence guarantees in the stochastically convex setting.

Ad click prediction: a view from the trenches

The goal of this paper is to highlight the close relationship between theoretical advances and practical engineering in this industrial setting, and to show the depth of challenges that appear when applying traditional machine learning methods in a complex dynamic system.