# Are Loss Functions All the Same?

@article{Rosasco2004AreLF, title={Are Loss Functions All the Same?}, author={Lorenzo Rosasco and Ernesto de Vito and Andrea Caponnetto and Michele Piana and Alessandro Verri}, journal={Neural Computation}, year={2004}, volume={16}, pages={1063-1076} }

In this letter, we investigate the impact of choosing different loss functions from the viewpoint of statistical learning theory. We introduce a convexity assumption, which is met by all loss functions commonly used in the literature, and study how the bound on the estimation error changes with the loss. We also derive a general result on the minimizer of the expected risk for a convex loss function in the case of classification. The main outcome of our analysis is that for classification, the…

## 432 Citations

### From Convex to Nonconvex: A Loss Function Analysis for Binary Classification

- Computer Science2010 IEEE International Conference on Data Mining Workshops
- 2010

A new differentiable non-convex loss function is proposed, called smoothed 0-1 loss function, which is a natural approximation of the 0-2 loss function and is robust for those noisy data sets with many outliers.

### Making Convex Loss Functions Robust to Outliers using $e$-Exponentiated Transformation

- Computer ScienceArXiv
- 2019

A novel generalization error bound is theoretically shown that the transformed loss function has a tighter bound for datasets corrupted by outliers and the empirical observation shows that the accuracy obtained can be significantly better than the same obtained using the original loss function and comparable to that obtained by some other state of the art methods in the presence of label noise.

### On the α-loss Landscape in the Logistic Model

- Computer Science2020 IEEE International Symposium on Information Theory (ISIT)
- 2020

This work studies the evolution of the optimization landscape of α-loss with respect to α using tools drawn from the study of strictly-locally-quasi-convex functions in addition to geometric techniques and interprets the results in terms of optimization complexity via normalized gradient descent.

### High Dimensional Classification via Empirical Risk Minimization: Improvements and Optimality

- Computer ScienceArXiv
- 2019

A family of classification algorithms defined by the principle of empirical risk minimization, in the high dimensional regime where the feature dimension $p$ and data number $n$ are both large and comparable are investigated.

### A Framework of Learning Through Empirical Gain Maximization

- Computer ScienceNeural Computation
- 2021

A framework of empirical gain maximization (EGM) to address the robust regression problem where heavy-tailed noise or outliers may be present in the response variable is developed and Tukey's biweight loss can be derived from the triweight kernel.

### ROC Curves, Loss Functions, and Distorted Probabilities in Binary Classification

- Computer ScienceMathematics
- 2022

This work shows that different measures of accuracy such as area under the curve of the ROC curve, the maximal balanced accuracy, and the maximally weighted accuracy are topologically equivalent, with natural inequalities relating them.

### Cost-sensitive Multiclass Classification Risk Bounds

- Computer ScienceICML
- 2013

A bound is developed for the case of cost-sensitive multiclass classification and a convex surrogate loss that goes back to the work of Lee, Lin and Wahba and is as easy to calculate as in binary classification.

### A new loss function for robust classification

- Computer ScienceIntell. Data Anal.
- 2014

The experimental results show that the proposed smoothed 0-1 loss function works better on data sets with noisy labels, noisy features, and outliers than several existing loss functions in the classification of noisy data sets.

### High Dimensional Classification via Regularized and Unregularized Empirical Risk Minimization: Precise Error and Optimal Loss.

- Computer Science
- 2019

This article provides an in-depth understanding of the classification performance of the empirical risk minimization framework, in both ridge-regularized and unregularized cases, when high dimensional data are considered, and identifies the simple square loss as the optimal choice for high dimensional classification, regardless of the number of training samples.

### Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors

- Computer Science, MathematicsAISTATS
- 2019

An exponential convergence of the expected classification error in the final phase of the stochastic gradient descent for a wide class of differentiable convex loss functions under similar assumptions is shown.

## References

SHOWING 1-10 OF 26 REFERENCES

### Statistical behavior and consistency of classification methods based on convex risk minimization

- Computer Science
- 2003

This study sheds light on the good performance of some recently proposed linear classification methods including boosting and support vector machines and shows their limitations and suggests possible improvements.

### Statistical Properties and Adaptive Tuning of Support Vector Machines

- Computer ScienceMachine Learning
- 2004

An approach to adaptively tuning the smoothing parameter(s) in the SVMs is described, based on the generalized approximate cross validation (GACV), which is an easily computable proxy of the GCKL.

### Scale-sensitive dimensions, uniform convergence, and learnability

- MathematicsProceedings of 1993 IEEE 34th Annual Foundations of Computer Science
- 1993

A characterization of learnability in the probabilistic concept model, solving an open problem posed by Kearns and Schapire, and shows that the accuracy parameter plays a crucial role in determining the effective complexity of the learner's hypothesis class.

### The covering number in learning theory

- MathematicsJ. Complex.
- 2002

This work gives estimates for the covering number of a ball of a reproducing kernel Hilbert space as a subset of the continuous function space by means of the regularity of the Mercer kernel K, and provides an example of a Mercer kernels to show that LK½ may not be generated by a Mercer kernel.

### On the mathematical foundations of learning

- Computer Science
- 2001

(1) A main theme of this report is the relationship of approximation to learning and the primary role of sampling (inductive inference). We try to emphasize relations of the theory of learning to the…

### Regularization Theory and Neural Networks Architectures

- Computer Science, MathematicsNeural Computation
- 1995

This paper shows that regularization networks encompass a much broader range of approximation schemes, including many of the popular general additive models and some of the neural networks, and introduces new classes of smoothness functionals that lead to different classes of basis functions.

### On the Bayes-risk consistency of regularized boosting methods

- Computer Science
- 2003

The main result of the paper is that certain regularized boosting algorithms provide Bayes-risk consistent classifiers under the sole assumption that the Bayes classifier may be approximated by a convex combination of the base classifiers.

### The Elements of Statistical Learning

- BusinessTechnometrics
- 2003

Chapter 11 includes more case studies in other areas, ranging from manufacturing to marketing research, and a detailed comparison with other diagnostic tools, such as logistic regression and tree-based methods.

### SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming

- Computer ScienceNeural Computation
- 2005

This article shows that the convergence behavior of the linear programming SVM is almost the same as that of the quadratic programming S VM, and proposes an upper bound for the misclassification error for general probability distributions.