• Corpus ID: 212717720

Can Implicit Bias Explain Generalization? Stochastic Convex Optimization as a Case Study

  title={Can Implicit Bias Explain Generalization? Stochastic Convex Optimization as a Case Study},
  author={Assaf Dauber and Meir Feder and Tomer Koren and Roi Livni},
The notion of implicit bias, or implicit regularization, has been suggested as a means to explain the surprising generalization ability of modern-days overparameterized learning algorithms. This notion refers to the tendency of the optimization algorithm towards a certain structured solution that often generalizes well. Recently, several papers have studied implicit regularization and were able to identify this phenomenon in various scenarios. We revisit this paradigm in arguably the simplest… 

Figures from this paper

Implicit Regularization in Deep Learning May Not Be Explainable by Norms

The results suggest that, rather than perceiving the implicit regularization via norms, a potentially more useful interpretation is minimization of rank, and it is demonstrated empirically that this interpretation extends to a certain class of non-linear neural networks, and hypothesize that it may be key to explaining generalization in deep learning.

Implicit Regularization in ReLU Networks with the Square Loss

It is proved that even for a single ReLU neuron, it is impossible to characterize the implicit regularization with the square loss by any explicit function of the model parameters, and a more general framework than the one considered so far may be needed to understand implicit regularizations for nonlinear predictors.

SGD Generalizes Better Than GD (And Regularization Doesn't Help)

It is shown that with the same number of steps GD may overfit and emit a solution with Ω(1) generalization error, and how regularizing the empirical risk minimized by GD essentially does not change the above result.

Is SGD a Bayesian sampler? Well, almost

Estimating the probability that an overparameterised DNN, trained with stochastic gradient descent or one of its variants, converges on a function consistent with a training set, implies that strong inductive bias in the parameter-function map, rather than a special property of SGD, is the primary explanation for why DNNs generalise so well in the overparametersised regime.

A Limitation of the PAC-Bayes Framework

An easy learning task that is not amenable to a PAC-Bayes analysis is demonstrated, and it is shown that for any algorithm that learns 1-dimensional linear classifiers there exists a (realizable) distribution for which the PAC- Bayes bound is arbitrarily large.

On Convergence and Generalization of Dropout Training

It is shown that dropout training with logistic loss achieves $\epsilon$-suboptimality in testerror in test error in $O(1/\ep silon)$ iterations.

Stochastic Training is Not Necessary for Generalization

It is demonstrated that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD, using modern architectures in settings with and without data augmentation.

SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs

This paper considers the problem of SCO and explores the role of implicit regularization, batch size and multiple epochs for SGD, and extends the results to the general learning setting by showing a problem which is learnable for any data distribution, and SGD is strictly better than RERM for any regularization function.

Benign Underfitting of SGD in Stochastic Convex Optimization

It turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis).

Benign Underfitting of Stochastic Gradient Descent

It turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis).



Uniform convergence may be unable to explain generalization in deep learning

Through numerous experiments, doubt is cast on the power of uniform convergence-based generalization bounds to provide a complete picture of why overparameterized deep networks generalize well.

Stochastic Convex Optimization

Stochastic convex optimization is studied, and it is shown that the key ingredient is strong convexity and regularization, which is only a sufficient, but not necessary, condition for meaningful non-trivial learnability.

Train faster, generalize better: Stability of stochastic gradient descent

We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically

Implicit Regularization in Deep Matrix Factorization

This work studies the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization, and finds that adding depth to a matrix factorizations enhances an implicit tendency towards low-rank solutions.

Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes

The performance of SGD without non-trivial smoothness assumptions is investigated, as well as a running average scheme to convert the SGD iterates to a solution with optimal optimization accuracy, and a new and simple averaging scheme is proposed which not only attains optimal rates, but can also be easily computed on-the-fly.

Connecting Optimization and Regularization Paths

This work studies the implicit regularization properties of optimization techniques by explicitly connecting their optimization paths to the regularization paths of ``corresponding'' regularized problems, and investigates one key consequence that borrows from the well-studied analysis of regularized estimators to obtain tight excess risk bounds of the iterates generated by optimization techniques.

The Implicit Bias of Gradient Descent on Separable Data

We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the

Early Stopping for Kernel Boosting Algorithms: A General Analysis With Localized Complexities

This paper exhibits a direct connection between the performance of a stopped iterate and the localized Gaussian complexity of the associated function class, and shows that the local fixed point analysis of Gaussian or Rademacher complexities can be used to derive optimal stopping rules.

Implicit Regularization in Deep Learning

It is shown that implicit regularization induced by the optimization method is playing a key role in generalization and success of deep learning models, and how different complexity measures can ensure generalization is studied to explain different observed phenomena in deep learning.

Understanding deep learning requires rethinking generalization

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.