Corpus ID: 6356183

C 4 . 5 , Class Imbalance , and Cost Sensitivity : Why Under-Sampling beats OverSampling

@inproceedings{Drummond2003C4,
  title={C 4 . 5 , Class Imbalance , and Cost Sensitivity : Why Under-Sampling beats OverSampling},
  author={C. Drummond},
  year={2003}
}
This paper takes a new look at two sampling schemes commonly used to adapt machine algorithms to imbalanced classes and misclassification costs. It uses a performance analysis technique called cost curves to explore the interaction of over and under-sampling with the decision tree learner C4.5. C4.5 was chosen as, when combined with one of the sampling schemes, it is quickly becoming the community standard when evaluating new cost sensitive learning algorithms. This paper shows that using C4.5… Expand
Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs?
TLDR
Three methods for dealing with data that has a skewed class distribution and nonuniform misclassification costs are compared in order to determine which produces the best overall classifier—and under what circumstances. Expand
Does cost-sensitive learning beat sampling for classifying rare classes?
TLDR
Two basic strategies for dealing with data that has a skewed class distribution and non-uniform misclassification costs are compared, one based on cost-sensitive learning while the other strategy employs sampling to create a more balanced class distribution in the training set. Expand
A Comparison Study of Cost-Sensitive Learning and Sampling Methods on Imbalanced Data Sets
TLDR
Comparison of three classification methods dealing with the data sets in which class distribution is imbalanced and has non-uniform misclassification cost finds cost-sensitive learning method is suitable for the classification of imbalanced dataset, and is more stable than sampling methods except the condition that data set is quite small. Expand
Thresholding for Making Classifiers Cost-sensitive
TLDR
A very simple, yet general and effective method to make any cost-insensitive classifiers (that can produce probability estimates) cost-sensitive, called Thresholding, selects a proper threshold from training instances according to the misclassification cost. Expand
Automatically countering imbalance and its empirical relationship to cost
TLDR
A wrapper paradigm is proposed that discovers the amount of re-sampling for a data set based on optimizing evaluation functions like the f-measure, Area Under the ROC Curve, cost, cost-curves, and the cost dependent f-measures to outperform the cost-sensitive classifiers in a cost- sensitive environment. Expand
Class Imbalance and Cost-Sensitive Decision Trees
  • Michael J. Siers, Md Zahidul Islam
  • Computer Science
  • ACM Trans. Knowl. Discov. Data
  • 2021
Class imbalance treatment methods and cost-sensitive classification algorithms are typically treated as two independent research areas. However, many of these techniques have properties in common.Expand
Cost-Sensitive Bayesian Network Learning Using Sampling
TLDR
This paper develops a new Bayesian network learning algorithm based on changing the data distribution to reflect the costs of misclassification and shows that this approach produces good results in comparison to more complex cost-sensitive decision tree algorithms. Expand
Cluster-based majority under-sampling approaches for class imbalance learning
TLDR
Cluster-based majority under-sampling approaches for selecting a representative subset from the majority class and using the representative subset and the all minority class samples as training data to improve accuracy over minority and majority classes are proposed. Expand
A Cost-Sensitive Deep Belief Network for Imbalanced Classification
TLDR
An evolutionary cost-sensitive deep belief network (ECS-DBN) for imbalanced classification that uses adaptive differential evolution to optimize the misclassification costs based on the training data that presents an effective approach to incorporating the evaluation measure into the objective function. Expand
Combining integrated sampling with SVM ensembles for learning from imbalanced datasets
TLDR
It is shown that SVMs may suffer from biased decision boundaries, and that their prediction performance drops dramatically when the data is highly skewed, and an integrated sampling technique is proposed that outperforms individual SVMs as well as several other state-of-the-art classifiers. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 13 REFERENCES
MetaCost: a general method for making classifiers cost-sensitive
TLDR
A principled method for making an arbitrary classifier cost-sensitive by wrapping a cost-minimizing procedure around it is proposed, called MetaCost, which treats the underlying classifier as a black box, requiring no knowledge of its functioning or change to it. Expand
Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria
This paper investigates how the splitting criteria and pruning methods of decision tree learning algorithms are influenced by misclassification costs or changes to the class distribution. SplittingExpand
SMOTE: Synthetic Minority Over-sampling Technique
TLDR
A combination of the method of oversampling the minority (abnormal) class and under-sampling the majority class can achieve better classifier performance (in ROC space) and a combination of these methods and the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy is evaluated. Expand
Reducing Misclassification Costs
TLDR
Algorithms for learning classification procedures that attempt to minimize the cost of misclassifying examples are explored and the Reduced Cost Ordering algorithm, a new method for creating a decision list, is described and compared to a variety of inductive learning approaches. Expand
The class imbalance problem: A systematic study
TLDR
The assumption that the class imbalance problem does not only affect decision tree systems but also affects other classification systems such as Neural Networks and Support Vector Machines is investigated. Expand
C4.5: Programs for Machine Learning
TLDR
A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Expand
Improved Use of Continuous Attributes in C4.5
TLDR
A reported weakness of C4.5 in domains with continuous attributes is addressed by modifying the formation and evaluation of tests on continuous attributes with an MDL-inspired penalty, leading to smaller decision trees with higher predictive accuracies. Expand
Explicitly representing expected cost: an alternative to ROC representation
TLDR
There is a point/line duality between the two representations of ROC representation, allowing most techniques used in ROC analysis to be readily reproduced in the cost space. Expand
The class imbalance problem: A systematic study
In machine learning problems, differences in prior class probabilities -- or class imbalances -- have been reported to hinder the performance of some standard classifiers, such as decision trees. T...
Data Mining for Direct Marketing: Problems and Solutions
TLDR
This paper discusses methods of coping with problems during data mining based on the experience on direct-marketing projects using data mining, and suggests a simple yet effective way of evaluating learning methods. Expand
...
1
2
...