Nitesh V. Chawla

Learn More
An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of “normal” examples with only a small percentage of “abnormal” or “interesting” examples. It is also the case that the(More)
Many real world data mining applications involve learning from imbalanced data sets. Learning from data sets that contain very few instances of the minority (or interesting) class usually produces biased classifiers that have a higher predictive accuracy over the majority class(es), but poorer predictive accuracy over the minority class. SMOTE (Synthetic(More)
A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult "real-world" problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training(More)
Traditional classification algorithms can be limited in their performance on highly unbalanced data sets. A popular stream of work for countering the problem of class imbalance has been the application of a sundry of sampling strategies. In this paper, we focus on designing modifications to support vector machines (SVMs) to appropriately tackle the problem(More)
Link prediction, i.e., predicting links or interactions between objects in a network, is an important task in network analysis. Although the problem has attracted much attention recently, there are several challenges that have not been addressed so far. First, most existing studies focus only on link prediction in homogeneous networks, where all objects and(More)
Learning from imbalanced data sets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in large-scale simulations, there is a correspondingly high cost for the misclassification of rare(More)
Imbalanced data sets are becoming ubiquitous, as many applications have very few instances of the “interesting” or “abnormal” class. Traditional machine learning algorithms can be biased towards majority class due to over-prevalence. It is desired that the interesting (minority) class prediction be improved, even if at the cost of additional majority class(More)
Bagging and boosting are two popular ensemble methods that typically achieve better accuracy than a single classifier. These techniques have limitations on massive datasets, as the size of the dataset can be a bottleneck. Voting many classifiers built on small subsets of data (“pasting small votes”) is a promising approach for learning from massive(More)