Are Loss Functions All the Same?

  title={Are Loss Functions All the Same?},
  author={Lorenzo Rosasco and Ernesto de Vito and Andrea Caponnetto and Michele Piana and Alessandro Verri},
  journal={Neural Computation},
In this letter, we investigate the impact of choosing different loss functions from the viewpoint of statistical learning theory. We introduce a convexity assumption, which is met by all loss functions commonly used in the literature, and study how the bound on the estimation error changes with the loss. We also derive a general result on the minimizer of the expected risk for a convex loss function in the case of classification. The main outcome of our analysis is that for classification, the… 

Making Convex Loss Functions Robust to Outliers using $e$-Exponentiated Transformation

A novel generalization error bound is theoretically shown that the transformed loss function has a tighter bound for datasets corrupted by outliers and the empirical observation shows that the accuracy obtained can be significantly better than the same obtained using the original loss function and comparable to that obtained by some other state of the art methods in the presence of label noise.

On the α-loss Landscape in the Logistic Model

This work studies the evolution of the optimization landscape of α-loss with respect to α using tools drawn from the study of strictly-locally-quasi-convex functions in addition to geometric techniques and interprets the results in terms of optimization complexity via normalized gradient descent.

High Dimensional Classification via Empirical Risk Minimization: Improvements and Optimality

A family of classification algorithms defined by the principle of empirical risk minimization, in the high dimensional regime where the feature dimension $p$ and data number $n$ are both large and comparable are investigated.

A Framework of Learning Through Empirical Gain Maximization

A framework of empirical gain maximization (EGM) to address the robust regression problem where heavy-tailed noise or outliers may be present in the response variable is developed and Tukey's biweight loss can be derived from the triweight kernel.

ROC Curves, Loss Functions, and Distorted Probabilities in Binary Classification

This work shows that different measures of accuracy such as area under the curve of the ROC curve, the maximal balanced accuracy, and the maximally weighted accuracy are topologically equivalent, with natural inequalities relating them.

Cost-sensitive Multiclass Classification Risk Bounds

A bound is developed for the case of cost-sensitive multiclass classification and a convex surrogate loss that goes back to the work of Lee, Lin and Wahba and is as easy to calculate as in binary classification.

A new loss function for robust classification

The experimental results show that the proposed smoothed 0-1 loss function works better on data sets with noisy labels, noisy features, and outliers than several existing loss functions in the classification of noisy data sets.

High Dimensional Classification via Regularized and Unregularized Empirical Risk Minimization: Precise Error and Optimal Loss.

This article provides an in-depth understanding of the classification performance of the empirical risk minimization framework, in both ridge-regularized and unregularized cases, when high dimensional data are considered, and identifies the simple square loss as the optimal choice for high dimensional classification, regardless of the number of training samples.

Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors

An exponential convergence of the expected classification error in the final phase of the stochastic gradient descent for a wide class of differentiable convex loss functions under similar assumptions is shown.

On the alpha-loss Landscape in the Logistic Model

This work analyzes the evolution of the optimization landscape of $\alpha$-loss with respect to $alpha$ using tools drawn from the study of strictly-locally-quasi-convex functions in addition to geometric techniques and interprets these results in terms of optimization complexity via normalized gradient descent.



Statistical behavior and consistency of classification methods based on convex risk minimization

This study sheds light on the good performance of some recently proposed linear classification methods including boosting and support vector machines and shows their limitations and suggests possible improvements.

Statistical Properties and Adaptive Tuning of Support Vector Machines

An approach to adaptively tuning the smoothing parameter(s) in the SVMs is described, based on the generalized approximate cross validation (GACV), which is an easily computable proxy of the GCKL.

The covering number in learning theory

This work gives estimates for the covering number of a ball of a reproducing kernel Hilbert space as a subset of the continuous function space by means of the regularity of the Mercer kernel K, and provides an example of a Mercer kernels to show that LK½ may not be generated by a Mercer kernel.

On the mathematical foundations of learning

(1) A main theme of this report is the relationship of approximation to learning and the primary role of sampling (inductive inference). We try to emphasize relations of the theory of learning to the

A note on different covering numbers in learning theory

Regularization Theory and Neural Networks Architectures

This paper shows that regularization networks encompass a much broader range of approximation schemes, including many of the popular general additive models and some of the neural networks, and introduces new classes of smoothness functionals that lead to different classes of basis functions.

On the Bayes-risk consistency of regularized boosting methods

The main result of the paper is that certain regularized boosting algorithms provide Bayes-risk consistent classifiers under the sole assumption that the Bayes classifier may be approximated by a convex combination of the base classifiers.

SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming

This article shows that the convergence behavior of the linear programming SVM is almost the same as that of the quadratic programming S VM, and proposes an upper bound for the misclassification error for general probability distributions.

Introduction to Support Vector Machines

Support Vector Machines (SVM’s) are intuitive, theoretically wellfounded, and have shown to be practically successful.

Statistical learning theory

Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.