# Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs?

@inproceedings{Weiss2007CostSensitiveLV, title={Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs?}, author={G. Weiss and Kate McCarthy and Bibi Zabar}, booktitle={DMIN}, year={2007} }

The classifier built from a data set with a highly skewed class distribution generally predicts the more frequently occurring classes much more often than the infrequently occurring classes. [...] Key Method The first method incorporates the misclassification costs into the learning algorithm while the other two methods employ oversampling or undersampling to make the training data more balanced. In this paper we empirically compare the effectiveness of these methods in order to determine which produces the… Expand

#### Figures, Tables, and Topics from this paper

#### 219 Citations

Cost-Based Sampling of Individual Instances

- Computer Science
- Canadian Conference on AI
- 2009

A general sampling approach that assigns weights to individual instances according to the cost function helps reveal the relationship between classification performance and class ratios and allows the identification of an appropriate class distribution for which, the learning method achieves a reasonable performance on the data. Expand

A Comparative Study of Data Sampling and Cost Sensitive Learning

- Computer Science
- 2008 IEEE International Conference on Data Mining Workshops
- 2008

This work investigates the performance of two cost sensitive learning techniques and four data sampling techniques for minimizing classification costs when data is imbalanced, and presents a comprehensive suite of experiments, utilizing 15 datasets with 10 cost ratios. Expand

A Monte Carlo study on methods for handling class imbalance

- 2017

Many applications of classification problems in machine learning involve class imbalance—a situation where the class of interest (the “minority” or “positive” class) makes up a very small percentage… Expand

Automatically countering imbalance and its empirical relationship to cost

- Computer Science
- Data Mining and Knowledge Discovery
- 2008

A wrapper paradigm is proposed that discovers the amount of re-sampling for a data set based on optimizing evaluation functions like the f-measure, Area Under the ROC Curve, cost, cost-curves, and the cost dependent f-measures to outperform the cost-sensitive classifiers in a cost- sensitive environment. Expand

Variance Ranking Attributes Selection Techniques for Binary Classification Problem in Imbalance Data

- Computer Science
- IEEE Access
- 2019

A novel similarity measurement technique ranked order similarity-ROS is used to evaluate the variance ranking attribute selection compared to the Pearson correlations and information gain technique, and shows better results than the benchmarks. Expand

Cost-Sensitive Universum-SVM

- Computer Science
- 2012 11th International Conference on Machine Learning and Applications
- 2012

This paper extends the U-SVM for problems with different misclassification costs, and presents practical conditions for the effectiveness of the cost sensitive U- SVM. Expand

Minimax Modifications of Linear Discriminant Analysis for Classification with Rare Classes

- Computer Science
- 2020 IEEE East-West Design & Test Symposium (EWDTS)
- 2020

Cost-efficient modifications of Linear Discriminant Analysis are presented allowing to mitigate the problem of classification for imbalanced samples with rare classes by minimizing maximal classification error among the classes. Expand

An Optimized Cost-Sensitive SVM for Imbalanced Data Learning

- Computer Science
- PAKDD
- 2013

An effective wrapper framework incorporating the evaluation measure (AUC and G-mean) into the objective function of cost sensitive SVM directly to improve the performance of classification by simultaneously optimizing the best pair of feature subset, intrinsic parameters and misclassification cost parameters is presented. Expand

Undersampling Near Decision Boundary for Imbalance Problems

- Computer Science
- 2019 International Conference on Machine Learning and Cybernetics (ICMLC)
- 2019

A novel undersampling method, the UnderSampling using Sensitivity (USS), based on sensitivity of each majority example, which confirms the superiority of the USS against one baseline method and five resampling methods. Expand

The OCS-SVM: An Objective-Cost-Sensitive SVM With Sample-Based Misclassification Cost Invariance

- Computer Science
- IEEE Access
- 2019

Inspired by the concept of the CS-SVM, a new SVM with sample-based misclassification cost invariance is proposed with the aim of constructing a relatively reliable classifier, which is defined as the one with low probabilities of finding a classifier that correctly classifies each misclassified sample. Expand

#### References

SHOWING 1-10 OF 21 REFERENCES

C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling

- Computer Science
- 2003

This paper shows that using C4.5 with undersampling establishes a reasonable standard for algorithmic comparison, and it is recommended that the cheapest class classifier be part of that standard as it can be better than under-sampling for relatively modest costs. Expand

C 4 . 5 , Class Imbalance , and Cost Sensitivity : Why Under-Sampling beats OverSampling

- 2003

This paper shows that using C4.5 with undersampling establishes a reasonable standard for algorithmic comparison, and it is recommended that the cheapest class classifier be part of that standard as it can be better than under-sampling for relatively modest costs. Expand

SMOTE: Synthetic Minority Over-sampling Technique

- Computer Science, Mathematics
- J. Artif. Intell. Res.
- 2002

A combination of the method of oversampling the minority (abnormal) class and under-sampling the majority class can achieve better classifier performance (in ROC space) and a combination of these methods and the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy is evaluated. Expand

C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure

- Computer Science
- 2003

This paper studies the quality of probabilistic estimates, pruning, and preprocessing the imbalanced data set by over or undersampling methods such that a fairly balanced training set is provided to the decision trees. Expand

Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction

- Computer Science
- J. Artif. Intell. Res.
- 2003

A "budget-sensitive" progressive sampling algorithm is introduced for selecting training examples based on the class associated with each example and it is shown that the class distribution of the resulting training set yields classifiers with good (nearly-optimal) classification performance. Expand

The Foundations of Cost-Sensitive Learning

- Computer Science
- IJCAI
- 2001

It is argued that changing the balance of negative and positive training examples has little effect on the classifiers produced by standard Bayesian and decision tree learning methods, and the recommended way of applying one of these methods is to learn a classifier from the training set and then to compute optimal decisions explicitly using the probability estimates given by the classifier. Expand

Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown

- Mathematics
- 2003

The problem of learning from imbalanced data sets, while not the same problem as learning when misclassication costs are unequal and unknown, can be handled in a similar manner. That is, in both… Expand

An iterative method for multi-class cost-sensitive learning

- Computer Science, Mathematics
- KDD
- 2004

This paper empirically evaluates the performance of the proposed method using benchmark data sets and proves that the method generally achieves better results than representative methods for cost-sensitive learning, in terms of predictive performance (cost minimization) and, in many cases, computational efficiency. Expand

Improving classifier utility by altering the misclassification cost ratio

- Computer Science
- UBDM '05
- 2005

By using a hold out set to identify the "best" cost ratio for learning, this paper is able to take advantage of this behavior and generate classifiers that outperform the accepted strategy of always using the actual cost information during the learning phase. Expand

The class imbalance problem: A systematic study

- Mathematics, Computer Science
- Intell. Data Anal.
- 2002

The assumption that the class imbalance problem does not only affect decision tree systems but also affects other classification systems such as Neural Networks and Support Vector Machines is investigated. Expand