Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem

@article{Lu2020BayesII,
  title={Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem},
  author={Yang Lu and Yiu-ming Cheung and Yuanyan Tang},
  journal={IEEE Transactions on Neural Networks and Learning Systems},
  year={2020},
  volume={31},
  pages={3525-3539}
}
Recent studies of imbalanced data classification have shown that the imbalance ratio (IR) is not the only cause of performance loss in a classifier, as other data factors, such as small disjuncts, noise, and overlapping, can also make the problem difficult. The relationship between the IR and other data factors has been demonstrated, but to the best of our knowledge, there is no measurement of the extent to which class imbalance influences the classification performance of imbalanced data. In… Expand
Assessing the data complexity of imbalanced datasets
TLDR
The experimental results show that the proposed measures assess the difficulty of imbalanced problems better than the original ones, and the difference in data complexity correlates to the predictive performance improvement obtained by applying DITs to the original datasets. Expand
A Graph-Based Measurement for Text Imbalance Classification
TLDR
This paper transforms the unknown distribution of data into a graph model and presents a graph-based imbalance index named GIR to predict the impact of imbalanced text data on classification performance and introduces an environmental factor that makes the imbalance index sensitive to the intrinsic characteristics of data. Expand
Sentimental analysis from imbalanced code-mixed data using machine learning approaches
TLDR
This paper addresses class imbalance problem which is one of the important issues in sentimental analysis and comes up with a solution to analyze sentiments for a class imbalanced code-mixed data using sampling technique combined with levenshtein distance metrics. Expand
wCM based hybrid pre-processing algorithm for class imbalanced dataset
  • Deepika Singh, Anju Saha, A. Gosain
  • Computer Science
  • J. Intell. Fuzzy Syst.
  • 2021
TLDR
A novel hybrid pre-processing algorithm focusing on treating the class-label noise in the imbalanced dataset, which suffers from other intrinsic factors such as class overlapping, non-linear class boundaries, small disjuncts, and borderline examples is presented. Expand
Comparative Analysis on Imbalanced Multi-class Classification for Malware Samples using CNN
TLDR
This paper uses Convolutional Neural Network (CNN) as a classification algorithm to study the effect of imbalanced datasets on deep learning approaches and demonstrates that methods such as cost sensitive learning, oversampling and cross validation have positive effects on the model classification performance, albeit in varying degrees. Expand
UFFDFR: Undersampling framework with denoising, fuzzy c-means clustering, and representative sample selection for imbalanced data classification
  • Ming Zheng, Tong Li, +5 authors Weiyi Yang
  • Computer Science
  • Inf. Sci.
  • 2021
TLDR
A novel three-stage undersampling framework with denoising, fuzzy c-means clustering, and representative sample selection (UFFDFR) is proposed that improves the classification performance on imbalanced data by removing noise and unrepresentative samples from the majority class. Expand
Saliency-based Weighted Multi-label Linear Discriminant Analysis
TLDR
The saliency-based weights obtained based on various kinds of affinity encoding prior information are used to reveal the probability of each instance to be salient for each of its classes in the multi-label problem at hand. Expand
Predicting pulsar stars using a random tree boosting voting classifier (RTB-VC)
TLDR
This study presents a hybrid machine learning classifier called the random trees boosting voting classifier (RTB-VC) for predicting pulsar stars, which is based on a combination of soft voting, hard voting, and weighted voting to obtain highly accurate and relevant criteria for finally predicting pulsars or non-pulsars. Expand
Review of Factors Affecting Efficiency of Twitter Data Sentiment Analysis
TLDR
Various factors affecting the accuracy of Twitter sentiment analysis are discussed, which can be very beneficial while designing an efficient classification model for twitter sentiment analysis. Expand
Random forest for dissimilarity based multi-view learning: application to radiomics. (Forêt aléatoire pour l'apprentissage multi-vues basé sur la dissimilarité: Application à la Radiomique)
TLDR
Three main results are presented: the demonstration and analysis of the effectiveness of the dissimilarity easurement embedded in the Random Forest method for HDLSS multi-view learning, and a new method for measuring dissimilarities from Random Forests, better adapted to this type of learning problem. Expand
...
1
2
...

References

SHOWING 1-10 OF 43 REFERENCES
An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics
TLDR
This work carries out a thorough discussion on the main issues related to using data intrinsic characteristics in this classification problem, and introduces several approaches and recommendations to address these problems in conjunction with imbalanced data. Expand
Types of minority class examples and their influence on learning classifiers from imbalanced data
TLDR
A method for an identification of four types of minority class examples, based on analyzing a class distribution in a local neighbourhood of the considered example, and it is demonstrated that the results of this analysis allow to differentiate classification performance of popular classifiers and pre-processing methods and to evaluate their areas of competence. Expand
A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches
TLDR
A taxonomy for ensemble-based methods to address the class imbalance where each proposal can be categorized depending on the inner ensemble methodology in which it is based is proposed and a thorough empirical comparison is developed by the consideration of the most significant published approaches to show whether any of them makes a difference. Expand
On the k-NN performance in a challenging scenario of imbalance and overlapping
TLDR
This local model is compared to other machine learning algorithms, attending to how their behaviour depends on a number of data complexity features (global imbalance, size of overlap region, and its local imbalance) and several conclusions useful for classifier design are inferred. Expand
A study of the behavior of several methods for balancing machine learning training data
TLDR
This work performs a broad experimental evaluation involving ten methods, three of them proposed by the authors, to deal with the class imbalance problem in thirteen UCI data sets, and shows that, in general, over-sampling methods provide more accurate results than under-sampled methods considering the area under the ROC curve (AUC). Expand
Class imbalances versus small disjuncts
TLDR
It is argued that, in order to improve classifier performance, it may be more useful to focus on the small disjuncts problem than it is tofocus on the class imbalance problem, and experiments suggest that the problem is not directly caused by class imbalances, but rather, that class imbalance may yield small disJuncts which will cause degradation. Expand
SMOTE: Synthetic Minority Over-sampling Technique
TLDR
A combination of the method of oversampling the minority (abnormal) class and under-sampling the majority class can achieve better classifier performance (in ROC space) and a combination of these methods and the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy is evaluated. Expand
Learning from Imbalanced Data
TLDR
A critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario is provided. Expand
Using Class Imbalance Learning for Software Defect Prediction
TLDR
This paper investigates different types of class imbalance learning methods, including resampling techniques, threshold moving, and ensemble algorithms, and concludes that AdaBoost.NC shows the best overall performance in terms of the measures including balance, G-mean, and Area Under the Curve (AUC). Expand
Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior
TLDR
This work develops a systematic study aiming to question whether class imbalances are truly to blame for the loss of performance of learning systems or whether the class imbalance are not a problem by themselves. Expand
...
1
2
3
4
5
...