# Are Loss Functions All the Same?

@article{Rosasco2004AreLF,
title={Are Loss Functions All the Same?},
author={Lorenzo Rosasco and Ernesto de Vito and Andrea Caponnetto and Michele Piana and Alessandro Verri},
journal={Neural Computation},
year={2004},
volume={16},
pages={1063-1076}
}
• L. Rosasco, +2 authors A. Verri
• Published 2004
• Medicine, Computer Science, Mathematics
• Neural Computation
In this letter, we investigate the impact of choosing different loss functions from the viewpoint of statistical learning theory. We introduce a convexity assumption, which is met by all loss functions commonly used in the literature, and study how the bound on the estimation error changes with the loss. We also derive a general result on the minimizer of the expected risk for a convex loss function in the case of classification. The main outcome of our analysis is that for classification, the… Expand
348 Citations

#### Topics from this paper

Making Convex Loss Functions Robust to Outliers using $e$-Exponentiated Transformation
A novel generalization error bound is theoretically shown that the transformed loss function has a tighter bound for datasets corrupted by outliers and the empirical observation shows that the accuracy obtained can be significantly better than the same obtained using the original loss function and comparable to that obtained by some other state of the art methods in the presence of label noise. Expand
From Convex to Nonconvex: A Loss Function Analysis for Binary Classification
• Mathematics, Computer Science
• 2010 IEEE International Conference on Data Mining Workshops
• 2010
A new differentiable non-convex loss function is proposed, called smoothed 0-1 loss function, which is a natural approximation of the 0-2 loss function and is robust for those noisy data sets with many outliers. Expand
On the α-loss Landscape in the Logistic Model
• Computer Science, Mathematics
• 2020 IEEE International Symposium on Information Theory (ISIT)
• 2020
This work studies the evolution of the optimization landscape of α-loss with respect to α using tools drawn from the study of strictly-locally-quasi-convex functions in addition to geometric techniques and interprets the results in terms of optimization complexity via normalized gradient descent. Expand
High Dimensional Classification via Empirical Risk Minimization: Improvements and Optimality
• Mathematics, Computer Science
• ArXiv
• 2019
A family of classification algorithms defined by the principle of empirical risk minimization, in the high dimensional regime where the feature dimension $p$ and data number $n$ are both large and comparable are investigated. Expand
Cost-sensitive Multiclass Classification Risk Bounds
• Mathematics, Computer Science
• ICML
• 2013
A bound is developed for the case of cost-sensitive multiclass classification and a convex surrogate loss that goes back to the work of Lee, Lin and Wahba and is as easy to calculate as in binary classification. Expand
A new loss function for robust classification
• Mathematics, Computer Science
• Intell. Data Anal.
• 2014
The experimental results show that the proposed smoothed 0-1 loss function works better on data sets with noisy labels, noisy features, and outliers than several existing loss functions in the classification of noisy data sets. Expand
Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors
• Computer Science, Mathematics
• AISTATS
• 2019
An exponential convergence of the expected classification error in the final phase of the stochastic gradient descent for a wide class of differentiable convex loss functions under similar assumptions is shown. Expand
On the alpha-loss Landscape in the Logistic Model
• Computer Science, Mathematics
• ArXiv
• 2020
This work analyzes the evolution of the optimization landscape of $\alpha$-loss with respect to $alpha$ using tools drawn from the study of strictly-locally-quasi-convex functions in addition to geometric techniques and interprets these results in terms of optimization complexity via normalized gradient descent. Expand
Learning with Non-Convex Truncated Losses by SGD
• Computer Science, Mathematics
• UAI
• 2019
A family of objective functions formed by truncating traditional loss functions, applicable to both shallow learning and deep learning are studied, which establish excess risk bounds of empirical risk minimization based on truncated losses for heavy-tailed output, and statistical error of an approximate stationary point found by stochastic gradient descent (SGD) method. Expand
Model selection and error estimation without the agonizing pain
• L. Oneto
• Computer Science
• Wiley Interdiscip. Rev. Data Min. Knowl. Discov.
• 2018
The purpose of this review is to give an intelligible overview of the problems of model selection and error estimation, by focusing on the ideas behind the different SLT‐based approaches and simplifying most of the technical aspects with the purpose of making them more accessible and usable in practice. Expand

#### References

SHOWING 1-10 OF 26 REFERENCES
Statistical behavior and consistency of classification methods based on convex risk minimization
We study how closely the optimal Bayes error rate can be approximately reached using a classification algorithm that computes a classifier by minimizing a convex upper bound of the classificationExpand
Statistical Properties and Adaptive Tuning of Support Vector Machines
• Mathematics, Computer Science
• Machine Learning
• 2004
An approach to adaptively tuning the smoothing parameter(s) in the SVMs is described, based on the generalized approximate cross validation (GACV), which is an easily computable proxy of the GCKL. Expand
Scale-sensitive dimensions, uniform convergence, and learnability
• Mathematics, Computer Science
• JACM
• 1997
A characterization of learnability in the probabilistic concept model, solving an open problem posed by Kearns and Schapire, and shows that the accuracy parameter plays a crucial role in determining the effective complexity of the learner's hypothesis class. Expand
The covering number in learning theory
This work gives estimates for the covering number of a ball of a reproducing kernel Hilbert space as a subset of the continuous function space by means of the regularity of the Mercer kernel K, and provides an example of a Mercer kernels to show that LK½ may not be generated by a Mercer kernel. Expand
On the mathematical foundations of learning
• Computer Science
• 2001
(1) A main theme of this report is the relationship of approximation to learning and the primary role of sampling (inductive inference). We try to emphasize relations of the theory of learning to theExpand
A note on different covering numbers in learning theory
• M. Pontil
• Computer Science, Mathematics
• J. Complex.
• 2003
It is formally shown that when F is made of smooth functions, the discrete covering number is close to its continuous counterpart, and illustrated in the case that F is a ball in a reproducing kernel Hilbert space. Expand
Regularization Theory and Neural Networks Architectures
• Computer Science, Mathematics
• Neural Computation
• 1995
This paper shows that regularization networks encompass a much broader range of approximation schemes, including many of the popular general additive models and some of the neural networks, and introduces new classes of smoothness functionals that lead to different classes of basis functions. Expand
On the Bayes-risk consistency of regularized boosting methods
• Mathematics
• 2003
The probability of error of classification methods based on convex combinations of simple base classifiers by boosting algorithms is investigated. The main result of the paper is that certainExpand
The Elements of Statistical Learning
• E. Ziegel
• Computer Science, Mathematics
• Technometrics
• 2003
Chapter 11 includes more case studies in other areas, ranging from manufacturing to marketing research, and a detailed comparison with other diagnostic tools, such as logistic regression and tree-based methods. Expand
SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming
• Mathematics, Computer Science
• Neural Computation
• 2005
This article shows that the convergence behavior of the linear programming SVM is almost the same as that of the quadratic programming S VM, and proposes an upper bound for the misclassification error for general probability distributions. Expand