A feature selection Bayesian approach for a clustering genetic algorithm

Abstract

Feature selection is an important task in clustering problems. Some features help to find useful clusters whereas others may hinder the clustering process. In other words, some selected features can provide better clusters. Besides, the feature selection process also allows the reduction of the dataset dimensionality, improving the clustering method efficiency. This work describes a Bayesian feature selection approach for a Clustering Genetic Algorithm (CGA). The general method can be described by means of four steps: (i) apply the CGA to some selected objects (sample) of the complete dataset; (ii) consider that the obtained clusters form different classes, which can be modeled by Bayesian networks; (iii) generate a Bayesian network and employ the Markov Blanket of the class variable to the feature subset selection task; (iv) apply the CGA in the complete dataset now formed only by the selected features. Initially, we are mainly interested in evaluating if the feature selection process makes sense in the context of the CGA, which can find the best clustering in a dataset according to the Average Silhouette Width criterion. Thus, our first investigation supposes an ideal situation, where the CGA has actually found the right clustering in step (i). Thus, the Bayesian networks are generated not in a sample, but in the complete dataset correctly clusteredlclassified. In this way we can better evaluate if the proposed hybrid method is appropriate, i.e. if the features selected by means of Bayesian networks are suitable for the CGA. In this sense, we performed simulations in three datasets that are benchmarks for data mining methods Wisconsin Breast Cancer, Mushroom and Congressional Voting Records. The results obtained in the simulations performed in the datasets formed by the selected features provided better results than those obtained in the complete datasets. Thus, we believe that the proposed method is very promising. Transactions on Information and Communications Technologies vol 29, © 2003 WIT Press, www.witpress.com, ISSN 1743-3517

Cite this paper

@inproceedings{Hruschka2003AFS, title={A feature selection Bayesian approach for a clustering genetic algorithm}, author={Estevam R. Hruschka}, year={2003} }