Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem
Online auction sites are a target for fraud due to their anonymity, number of potential targets and low likelihood of identification. Researchers have developed methods for identifying fraud. However, these methods must be individually tailored for each type of fraud, since each differs in the characteristics important for their identification. Using supervised learning methods, it is possible to produce classifiers for specific types of fraud by providing a dataset where instances with behaviours of interest are assigned to a separate class. However this requires multiple labelled datasets: one for each fraud type of interest. It is difficult to use real-world datasets for this purpose since they are difficult to label, often limited in size, and contain zero or multiple suspicious behaviours that may or may not be under investigation. The aims of this work are to: (1) demonstrate the approach of using supervised learning together with a validated synthetic data generator to create fraud detection models that are experimentally more accurate than existing methods and that is effective over real data, and (2) to evaluate a set of features for use in general fraud detection is shown to further improve the performance of the created detection models. The approach is as follows: the data generator is an agent-based simulation modelled on users in commercial online auction data. The simulation is extended using fraud agents which model a known type of online auction fraud called competitive shilling. These agents are added to the simulation to produce the synthetic datasets. Features extracted from this data are used as training data for supervised learning. Using this approach, we optimise an existing fraud detection algorithm, and produce classifiers capable of detecting shilling fraud. Experimental results with synthetic data show the new models have significant improvements in detection accuracy. Results with commercial data show the models identify users with suspicious behaviour. 2013 Elsevier Ltd. All rights reserved.