SMOTE: Synthetic Minority Over-sampling Technique

  title={SMOTE: Synthetic Minority Over-sampling Technique},
  author={N. Chawla and K. Bowyer and Lawrence O. Hall and W. Philip Kegelmeyer},
An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under… 

Combination approach of SMOTE and biased-SVM for imbalanced datasets

  • He-Yong Wang
  • Computer Science
    2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence)
  • 2008
A new approach to construct the classifiers from imbalanced datasets is proposed by combining SMOTE (synthetic minority over-sampling technique) and Biased-SVM (biased support vector machine) approaches, which some experimental results confirms can achieve better classifier performance.


This chapter provides an overview of the sampling strategies as well as classification algorithms developed for countering class imbalance, and considers the issues of correctly evaluating the performance of a classifier on imbalanced datasets and presents a discussion on various metrics.

A boosting based approach to handle imbalanced data

A novel boosting-based algorithm for learning from imbalanced datasets, based on a combination of the proposed Peak under-sampling algorithm and over-sampled technique (SMOTE) in the boosting procedure is proposed.

Building Accurate Classifiers from Imbalanced Data Sets

Experimental results show that an approach which utilizes a combination of undersampling and the generation of synthetic minority examples to improve the classification accuracy on small classes in data with large differences in class size was effective.

A new sampling approach for classification of imbalanced data sets with high density

Two new sampling methods, based on borderline-SMOTE, are proposed, which could achieve a better performance than random over sampling, SMOTE (Synthetic minority over-sampling technique) and Borderline- SMOTE in AUC (Area under Receiver Operating Characteristics Curve) metric evaluate method, when the sampling rate makes the majority and minority class samples approximate equilibrium.

Classification of Imbalanced Data of Medical Diagnosis using Sampling Techniques

The main objective is to handle the imbalance classification problem occurring in the medical diagnosis of rare diseases and combines the benefits of both undersampling and oversampling.

A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning

A novel theoretical analysis of the SMOTE method is developed by deriving the probability distribution of theSMOTE generated samples by means of a mathematical formulation, which allows us to compare the density of the generated samples with the true underlying class-conditional density, in order to assess how representative thegenerated samples are.

CUSBoost: Cluster-Based Under-Sampling with Boosting for Imbalanced Classification

A new clustering-based under-sampling approach with boosting (AdaBoost) algorithm, called CUSBoost, for effective imbalanced classification and is evaluated against state-of-the-art methods based on ensemble learning on 13 imbalanced binary and multi-class datasets with various imbalance ratios.

Cluster-based majority under-sampling approaches for class imbalance learning

Cluster-based majority under-sampling approaches for selecting a representative subset from the majority class and using the representative subset and the all minority class samples as training data to improve accuracy over minority and majority classes are proposed.

An Active Under-Sampling Approach for Imbalanced Data Classification

  • Zeping YangDaqi Gao
  • Computer Science
    2012 Fifth International Symposium on Computational Intelligence and Design
  • 2012
An active under-sampling approach is proposed for handling the imbalanced problem and can effectively improve the classification accuracy of minority classes while maintaining the overall classification performance by the experimental results.



MetaCost: a general method for making classifiers cost-sensitive

A principled method for making an arbitrary classifier cost-sensitive by wrapping a cost-minimizing procedure around it is proposed, called MetaCost, which treats the underlying classifier as a black box, requiring no knowledge of its functioning or change to it.

The Class Imbalance Problem: Significance and Strategies

This paper demonstrates experimentally that, at least in the case of connectionist systems, class imbalances hinder the performance of standard clas-siiers and compares several approaches previously proposed to deal with the problem.

Feature Selection for Unbalanced Class Distribution and Naive Bayes

This paper describes an approach to feature subset selection that takes into account problem speciics and learning algorithm characteristics, and shows that considering domain and algorithm characteristics signiicantly improves the results of classiication.

Noisy replication in skewed binary classification

Robust Classification for Imprecise Environments

It is shown that it is possible to build a hybrid classifier that will perform at least as well as the best available classifier for any target conditions, and in some cases, the performance of the hybrid actually can surpass that of the best known classifier.

C4.5: Programs for Machine Learning

A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.

Addressing the Curse of Imbalanced Training Sets: One-Sided Selection

Criteria to evaluate the utility of clas-siiers induced from such imbalanced training sets are discussed, explanation of the poor behavior of some learners under these circumstances is given, and a simple technique called one-sided selection of examples is suggested.

Context-sensitive learning methods for text categorization

RIPPER and sleeping-experts perform extremely well across a wide variety of categorization problems, generally outperforming previously applied learning methods and are viewed as a confirmation of the usefulness of classifiers that represent contextual information.