SMOTE: Synthetic Minority Over-sampling Technique
@article{Chawla2002SMOTESM, title={SMOTE: Synthetic Minority Over-sampling Technique}, author={N. Chawla and K. Bowyer and Lawrence O. Hall and W. Philip Kegelmeyer}, journal={ArXiv}, year={2002}, volume={abs/1106.1813} }
An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under…
Figures and Tables from this paper
16,856 Citations
Combination approach of SMOTE and biased-SVM for imbalanced datasets
- Computer Science2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence)
- 2008
A new approach to construct the classifiers from imbalanced datasets is proposed by combining SMOTE (synthetic minority over-sampling technique) and Biased-SVM (biased support vector machine) approaches, which some experimental results confirms can achieve better classifier performance.
IMBALANCED DATASETS : FROM SAMPLING TO CLASSIFIERS
- Computer Science
- 2013
This chapter provides an overview of the sampling strategies as well as classification algorithms developed for countering class imbalance, and considers the issues of correctly evaluating the performance of a classifier on imbalanced datasets and presents a discussion on various metrics.
A boosting based approach to handle imbalanced data
- Computer Science2022 30th International Conference on Electrical Engineering (ICEE)
- 2022
A novel boosting-based algorithm for learning from imbalanced datasets, based on a combination of the proposed Peak under-sampling algorithm and over-sampled technique (SMOTE) in the boosting procedure is proposed.
Building Accurate Classifiers from Imbalanced Data Sets
- Computer Science
Experimental results show that an approach which utilizes a combination of undersampling and the generation of synthetic minority examples to improve the classification accuracy on small classes in data with large differences in class size was effective.
A new sampling approach for classification of imbalanced data sets with high density
- Computer Science2014 International Conference on Big Data and Smart Computing (BIGCOMP)
- 2014
Two new sampling methods, based on borderline-SMOTE, are proposed, which could achieve a better performance than random over sampling, SMOTE (Synthetic minority over-sampling technique) and Borderline- SMOTE in AUC (Area under Receiver Operating Characteristics Curve) metric evaluate method, when the sampling rate makes the majority and minority class samples approximate equilibrium.
Classification of Imbalanced Data of Medical Diagnosis using Sampling Techniques
- Computer Science
- 2021
The main objective is to handle the imbalance classification problem occurring in the medical diagnosis of rare diseases and combines the benefits of both undersampling and oversampling.
A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning
- Computer ScienceMachine Learning
- 2023
A novel theoretical analysis of the SMOTE method is developed by deriving the probability distribution of theSMOTE generated samples by means of a mathematical formulation, which allows us to compare the density of the generated samples with the true underlying class-conditional density, in order to assess how representative thegenerated samples are.
CUSBoost: Cluster-Based Under-Sampling with Boosting for Imbalanced Classification
- Computer Science2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS)
- 2017
A new clustering-based under-sampling approach with boosting (AdaBoost) algorithm, called CUSBoost, for effective imbalanced classification and is evaluated against state-of-the-art methods based on ensemble learning on 13 imbalanced binary and multi-class datasets with various imbalance ratios.
Cluster-based majority under-sampling approaches for class imbalance learning
- Computer Science2010 2nd IEEE International Conference on Information and Financial Engineering
- 2010
Cluster-based majority under-sampling approaches for selecting a representative subset from the majority class and using the representative subset and the all minority class samples as training data to improve accuracy over minority and majority classes are proposed.
An Active Under-Sampling Approach for Imbalanced Data Classification
- Computer Science2012 Fifth International Symposium on Computational Intelligence and Design
- 2012
An active under-sampling approach is proposed for handling the imbalanced problem and can effectively improve the classification accuracy of minority classes while maintaining the overall classification performance by the experimental results.
References
SHOWING 1-10 OF 39 REFERENCES
MetaCost: a general method for making classifiers cost-sensitive
- Computer ScienceKDD '99
- 1999
A principled method for making an arbitrary classifier cost-sensitive by wrapping a cost-minimizing procedure around it is proposed, called MetaCost, which treats the underlying classifier as a black box, requiring no knowledge of its functioning or change to it.
The Class Imbalance Problem: Significance and Strategies
- Computer Science
- 2000
This paper demonstrates experimentally that, at least in the case of connectionist systems, class imbalances hinder the performance of standard clas-siiers and compares several approaches previously proposed to deal with the problem.
Feature Selection for Unbalanced Class Distribution and Naive Bayes
- Computer ScienceICML
- 1999
This paper describes an approach to feature subset selection that takes into account problem speciics and learning algorithm characteristics, and shows that considering domain and algorithm characteristics signiicantly improves the results of classiication.
Robust Classification for Imprecise Environments
- Computer ScienceMachine Learning
- 2004
It is shown that it is possible to build a hybrid classifier that will perform at least as well as the best available classifier for any target conditions, and in some cases, the performance of the hybrid actually can surpass that of the best known classifier.
C4.5: Programs for Machine Learning
- Computer Science
- 1992
A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Addressing the Curse of Imbalanced Training Sets: One-Sided Selection
- Computer ScienceICML
- 1997
Criteria to evaluate the utility of clas-siiers induced from such imbalanced training sets are discussed, explanation of the poor behavior of some learners under these circumstances is given, and a simple technique called one-sided selection of examples is suggested.
The use of the area under the ROC curve in the evaluation of machine learning algorithms
- Computer SciencePattern Recognit.
- 1997
Context-sensitive learning methods for text categorization
- Computer ScienceSIGIR '96
- 1996
RIPPER and sleeping-experts perform extremely well across a wide variety of categorization problems, generally outperforming previously applied learning methods and are viewed as a confirmation of the usefulness of classifiers that represent contextual information.