Encyclopedia of Data Warehousing and Mining Volume I A-h Idea Group Reference 0 Hierarchical Document Clustering

Abstract

Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI of the trademark or registered trademark. All work contributed to this encyclopedia set is new, previously-unpublished material. The views expressed in this encyclopedia set are those of the authors, but not necessarily of the publisher. INTRODUCTION Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another , but are dissimilar to documents in other clusters. Unlike document classification (Wang, Zhou, & He, 2001), no labeled documents are provided in clustering; hence, clustering is also known as unsupervised learning. Hierarchical document clustering organizes clusters into a tree or a hierarchy that facilitates browsing. The parent-child relationship among the nodes in the tree can be viewed as a topic-subtopic relationship in a subject hierarchy such as the Yahoo! directory. This chapter discusses several special challenges in hierarchical document clustering: high dimensionality, high volume of data, ease of browsing, and meaningful cluster labels. State-of-the-art document clustering algorithms are reviewed: the partitioning method (Steinbach, Karypis, & Kumar, 2000), agglomerative and divisive hierarchical clustering (Kaufman & Rousseeuw, 1990), and frequent itemset-based hierarchical clustering (Fung, Wang, & Ester, 2003). The last one, which was recently developed by the authors, is further elaborated since it has been specially designed to address the hierarchical document clustering problem.

2 Figures and Tables

Cite this paper

@inproceedings{Roth2006EncyclopediaOD, title={Encyclopedia of Data Warehousing and Mining Volume I A-h Idea Group Reference 0 Hierarchical Document Clustering}, author={Kristin M. Roth and Jennifer Neidig and Eva Brennan and Alana Bubnis and R. S. M. Davies and Sue VanderHook and Diane Huskinson and S. Reed and Larissa Zearfoss and Michelle Potter and Benjamin C. M. Fung and Ke Wang}, year={2006} }