• Corpus ID: 57189215

Reconciling modern machine learning and the bias-variance trade-off

@article{Belkin2018ReconcilingMM,
  title={Reconciling modern machine learning and the bias-variance trade-off},
  author={Mikhail Belkin and Daniel J. Hsu and Siyuan Ma and Soumik Mandal},
  journal={ArXiv},
  year={2018},
  volume={abs/1812.11118}
}
The question of generalization in machine learning---how algorithms are able to learn predictors from a training sample to make accurate predictions out-of-sample---is revisited in light of the recent breakthroughs in modern machine learning technology. The classical approach to understanding generalization is based on bias-variance trade-offs, where model complexity is carefully calibrated so that the fit on the training sample reflects performance out-of-sample. However, it is now common… 
Benign interpolation of noise in deep learning
TLDR
The findings suggest that the notion of model capacity needs to be modified to consider the distributed way training data is fitted across sub-units, and it is shown that models tend to fit uncorrupted samples first.
Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime
TLDR
A quantitative theory for the double descent of test error in the so-called lazy learning regime of neural networks is developed by considering the problem of learning a high-dimensional function with random features regression, and it is shown that the bias displays a phase transition at the interpolation threshold, beyond which it remains constant.
Risk Monotonicity in Statistical Learning
TLDR
This paper derives the first consistent and risk-monotonic algorithms for a general statistical learning setting under weak assumptions, consequently answering some questions posed by [53] on how to avoid non- monotonic behavior of risk curves and shows that risk monotonicity need not necessarily come at the price of worse excess risk rates.
Benefit of Interpolation in Nearest Neighbor Algorithms
TLDR
This work considers a class of interpolated weighting schemes and reveals a U-shaped performance curve, and proves that a mild degree of data interpolation improves the prediction accuracy and statistical stability over those of the (un-interpolated) optimal $k$NN algorithm.
Predictive Model Degrees of Freedom in Linear Regression
TLDR
This work proposes a measure with a proper adjustment based on the squared covariance between the predictions and observations which can reconcile the “double descent” phenomenon with the classical theory and opens doors to an extended definition of model degrees of freedom in modern predictive settings.
On the interplay between data structure and loss function in classification problems
TLDR
This work considers an analytically tractable model of structured data, where the input covariance is built from independent blocks allowing us to tune the saliency of low-dimensional structures and their alignment with respect to the target function.
Harmless Interpolation of Noisy Data in Regression
TLDR
It is shown that the fundamental generalization (mean-squared) error of any interpolating solution in the presence of noise decays to zero with the number of features, and overparameterization can be beneficial in ensuring harmless interpolation of noise.
Understanding overfitting peaks in generalization error: Analytical risk curves for l2 and l1 penalized interpolation
TLDR
A generative and fitting model pair is introduced and it is shown that the overfitting peak can be dissociated from the point at which the fitting function gains enough dof's to match the data generative model and thus provides good generalization.
A Modern Take on the Bias-Variance Tradeoff in Neural Networks
TLDR
It is found that both bias and variance can decrease as the number of parameters grows, and a new decomposition of the variance is introduced to disentangle the effects of optimization and data sampling.
Benign overfitting in linear regression
TLDR
A characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.
...
...

References

SHOWING 1-10 OF 23 REFERENCES
To understand deep learning we need to understand kernel learning
TLDR
It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods.
Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate
TLDR
A theoretical foundation for interpolated classifiers is taken by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted $k$-nearest neighbor schemes, and consistency or near-consistency is proved for these schemes in classification and regression problems.
Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks
TLDR
It is shown that with the quadratic activations, the optimization landscape of training, such shallow neural networks, has certain favorable characteristics that allow globally optimal models to be found efficiently using a variety of local search heuristics.
Gaussian Processes for Machine Learning
TLDR
The treatment is comprehensive and self-contained, targeted at researchers and students in machine learning and applied statistics, and deals with the supervised learning problem for both regression and classification.
Boosting the margin: A new explanation for the effectiveness of voting methods
TLDR
It is shown that techniques used in the analysis of Vapnik's support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error.
An Analysis of Deep Neural Network Models for Practical Applications
TLDR
This work presents a comprehensive analysis of important metrics in practical applications: accuracy, memory footprint, parameters, operations count, inference time and power consumption and believes it provides a compelling set of information that helps design and engineer efficient DNNs.
Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers
TLDR
A novel perspective on AdaBoost and random forests is introduced that proposes that the two algorithms work for similar reasons and concludes that boosting should be used like random forests: with large decision trees and without direct regularization or early stopping.
Generalization Properties of Learning with Random Features
TLDR
The results shed light on the statistical computational trade-offs in large scale kernelized learning, showing the potential effectiveness of random features in reducing the computational complexity while keeping optimal generalization properties.
Greedy function approximation: A gradient boosting machine.
TLDR
A general gradient descent boosting paradigm is developed for additive expansions based on any fitting criterion, and specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification.
The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network
TLDR
Results in this paper show that if a large neural network is used for a pattern classification problem and the learning algorithm finds a network with small weights that has small squared error on the training patterns, then the generalization performance depends on the size of the weights rather than the number of weights.
...
...