Document Clustering Using K-Means, Heuristic K-Means and Fuzzy C-Means

Abstract

Document clustering refers to unsupervised classification (categorization) of documents into groups (clusters) in such a way that the documents in a cluster are similar, whereas documents in different clusters are dissimilar. The documents may be web pages, blog posts, news articles, or other text files. This paper presents our experimental work on applying K-means, heuristic K-means and fuzzy C-means algorithms for clustering text documents. We have experimented with different representations (tf, tf.idf & Boolean) and different feature selection schemes (with or without stop word removal & with or without stemming). We ran our implementations on some standard datasets and computed various performance measures for these algorithms. The results indicate that tf.idf representation, and use of stemming obtains better clustering. Moreover, fuzzy clustering produces better results than both K-means and heuristic K-means on almost all datasets, and is a more stable method.

3 Figures and Tables

Cite this paper

@article{Singh2011DocumentCU, title={Document Clustering Using K-Means, Heuristic K-Means and Fuzzy C-Means}, author={Vivek Kumar Singh and Nisha Tiwari and Shekhar Garg}, journal={2011 International Conference on Computational Intelligence and Communication Networks}, year={2011}, pages={297-301} }