Comparison of Agglomerative and Partitional Document Clustering Algorithms

@inproceedings{Zhao2002ComparisonOA,
  title={Comparison of Agglomerative and Partitional Document Clustering Algorithms},
  author={Ying Zhao and George Karypis},
  year={2002}
}
Abstract : Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters, and in greatly improving the retrieval performance either via cluster-driven dimensionality reduction, term-weighting, or query expansion. [] Key Result Our experimental evaluation shows that for every criterion function, partitional algorithms always lead to better clustering…

Tables from this paper

COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DOCUMENTS
TLDR
The experimental results showed that the agglomerative algorithm that uses I1 as its criterion function for choosing which clusters to merge produced better clusters quality than the other criterion functions in term of entropy and purity as external measures.
Multilevel �-way Document Clustering : Experiments & Analysis
TLDR
This paper proposes a multi-level optimization technique in the context of document clustering and experimentally shows that multilevel optimizers produce high quality clusters, take very less time and also are scalable to large datasets.
A Comparison of Two Document Clustering Approaches for Clustering Medical Documents
TLDR
The main goals of this paper are to experimentally evaluate the performance of six criterion functions in the context of partitional clustering approach, and to establish the right clustering algorithm to produce high quality clustering of real-world medical documents in order to discover hidden knowledge by analyzing the produced clusters.
On the Performance of Feature Weighting K-Means for Text Subspace Clustering
TLDR
A performance study of a new subspace clustering algorithm for large sparse text data that automatically calculates the feature weights in the k-means clustering process, and which quickly converges to a local optimal solution and is scalable to the number of documents, terms and thenumber of clusters.
Evaluation of Partitional Algorithms for Clustering Medical Documents
TLDR
The experimental results show that E1 leads to the best solution using repeated bisection as clustering method in term entropy, and I1 is the best using direct clustering methods in term of both entropy and purity.
Improving Clustering Methods By Exploiting Richness Of Text Data
TLDR
The new text clustering methods introduced in this thesis can be widely applied in various domains that involve analysis of text data and highlight that exploiting user queries improves Search Result Clustering(SRC); utilizing rich features in weighting schemes and distance measures improves soft subspace clustering; utilizing multiple views and a multi-objective cluster oriented method improves clustering ensemble methods.
A multilevel K-Means algorithm for the clustering problem
TLDR
This work suggests looking at the clustering problem as a hierarchical optimization process going through different levels evolving from a coarse grain to fine grain strategy to obtain a better clustering of the original problem by refining the intermediate different clustering using the popular K-Means algorithm.
The Parameter-less Randomized Gravitational Clustering algorithm with online clusters’ structure characterization
TLDR
This paper presents a data clustering algorithm that does not require a parameter setting process [the Parameter-less Randomized Gravitational Clustering algorithm (Pl-Rgc) and combines it with a mechanism, based in micro-clusters ideas, for representing a cluster as a set of prototypes.
An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data
TLDR
This paper presents a new k-means type algorithm for clustering high-dimensional objects in sub-spaces that can generate better clustering results than other subspace clustering algorithms and is also scalable to large data sets.
High-Order Co-clustering Text Data on Semantics-Based Representation Model
TLDR
Experimental results on benchmark data sets have shown that the proposed high-order co-clustering on high- order structure outperforms the general co-Clustering algorithm on bipartite text data, such as document-term, document-concept and document-(term+concept).
...
...

References

SHOWING 1-10 OF 58 REFERENCES
Criterion Functions for Document Clustering ∗ Experiments and Analysis
TLDR
This study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed in the past, and involves both a comprehensive experimental evaluation and an analysis of the characteristics of the various criterion functions and their effect on the clusters they produce.
Recent trends in hierarchic document clustering: A critical review
  • P. Willett
  • Computer Science
    Inf. Process. Manag.
  • 1988
Fast and effective text mining using linear-time document clustering
TLDR
An unsupervised, near-linear time text clustering system that offers a number of algorithm choices for each phase, and a refinement to center adjustment, “vector average damping,” that further improves cluster quality.
A Comparison of Document Clustering Techniques
TLDR
This paper compares the two main approaches to document clustering, agglomerative hierarchical clustering and K-means, and indicates that the bisecting K-MEans technique is better than the standard K-Means approach and as good or better as the hierarchical approaches that were tested for a variety of cluster evaluation metrics.
Chameleon: Hierarchical Clustering Using Dynamic Modeling
TLDR
Chameleon's key feature is that it accounts for both interconnectivity and closeness in identifying the most similar pair of clusters, which is important for dealing with highly variable clusters.
C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
TLDR
A novel hierarchical clustering algorithm called C HAMELEON that measures the similarity of two clusters based on a dynamic model and can discover natural clusters that many existing state of the art clustering algorithms fail to find.
CURE: an efficient clustering algorithm for large databases
TLDR
This work proposes a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size, and demonstrates that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.
Scatter/Gather: a cluster-based approach to browsing large document collections
TLDR
This work presents a document browsing technique that employs document clustering as its primary operation, and presents fast (linear time) clustering algorithms which support this interactive browsing paradigm.
Concept Indexing: A Fast Dimensionality Reduction Algorithm With Applications to Document Retrieval and Categorization
TLDR
Experimental results show that the dimensionality reduction computed by CI achieves comparable retrieval performance to that obtained using LSI, while requiring an order of magnitude less time.
Data clustering: a review
TLDR
An overview of pattern clustering methods from a statistical pattern recognition perspective is presented, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners.
...
...