A comparison of two suffix tree-based document clustering algorithms

  title={A comparison of two suffix tree-based document clustering algorithms},
  author={Muhammad Rafi and M. Maujood and M. M. Fazal and S. M. Ali},
  journal={2010 International Conference on Information and Emerging Technologies},
  • M. Rafi, M. Maujood, S. M. Ali
  • Published 14 June 2010
  • Computer Science
  • 2010 International Conference on Information and Emerging Technologies
Document clustering as an unsupervised approach extensively used to navigate, filter, summarize and manage large collection of document repositories like the World Wide Web (WWW). Recently, focuses in this domain shifted from traditional vector based document similarity for clustering to suffix tree based document similarity, as it offers more semantic representation of the text present in the document. In this paper, we compare and contrast two recently introduced approaches to document… 

Figures and Tables from this paper

Affinity Propagation Based Document Clustering Using Suffix Tree

Proposed affinity propagation clustering approach is very effective on clustering the documents of standard document OHSUMED dataset by comparison with existing document clustering methods.

Clustering textual documents by extracting sequence from word-of-graph

A sequence based-representation of document that is extracted from graphof-word of the document that outperforms the traditional approaches on clustering measures like: Purity, Entropy and F-Score.

Document Clustering based on Topic Maps

A new approach for document clustering based on the Topic Map representation of the documents is introduced and a similarity measure is proposed based upon the inferred information through topic maps data and structures.

Document Clustering Approaches using Affinity Propagation

This work will study the key challenges of the clustering problem, as it applies to the text domain, and discuss the key methods used for text clustering, and their relative advantages.

Auto-assemblage for Suffix Tree Clustering

The paper presents the tool, which describe the algorithmic steps that are used in Suffix Tree Clustering (STC) algorithm for clustering the documents, and a small introductory part about the partitioned and hierarchical document clustering techniques.

Study of Different Document Representation Models for Finding Phrase-Based Similarity

This paper analyzes and compares different representation models on different parameters to find phrase-based similarity and shows how different document representation models can store words, phrases, or converted numerical data to find phrases.

Improving Suffix Tree Clustering Algorithm for Web Documents

An improved suffix tree clustering method that combines vector space model with Pearson correlation coefficient, calculates the relevant of clusters based on document vector of all clusters, and then utilizes the relevant vectors of clusters and the correlations between them to calculate the similarity for cluster merging, improves the clustering process of documents.

A Semi-supervised approach to Document Clustering with Sequence Constraints

The proposed semi-supervised approach to document clustering is implemented and extensively tested on three standard text mining datasets and clearly outperforms the recently proposed algorithms for document clustered in term of standard evaluation measures.

An Analytical Assessment on Document Clustering

This paper articulates the key requirements for web document clustering and clusters would be created on the full text of the web documents and the comparison of different clustering algorithms is focused on.


The authors created applications that use these two algorithms and tested them on the same corpus of documents and presented improvements that provide faster search and better search results.



Efficient Phrase-Based Document Similarity for Clustering

The phrase-based document similarity is applied to the group-average Hierarchical Agglomerative Clustering (HAC) algorithm and the new clustering approach is developed, which is very effective on clustering the documents of two standard document benchmark corpora OHSUMED and RCV1.

Efficient phrase-based document indexing for Web document clustering

A novel phrase-based document index model, the document index graph, is presented, which allows for incremental construction of a phrase- based index of the document set with an emphasis on efficiency, rather than relying on single-term indexes only.

Text document clustering based on frequent word meaning sequences

Fast and effective text mining using linear-time document clustering

An unsupervised, near-linear time text clustering system that offers a number of algorithm choices for each phase, and a refinement to center adjustment, “vector average damping,” that further improves cluster quality.

A Comparison of Document Clustering Techniques

This paper compares the two main approaches to document clustering, agglomerative hierarchical clustering and K-means, and indicates that the bisecting K-MEans technique is better than the standard K-Means approach and as good or better as the hierarchical approaches that were tested for a variety of cluster evaluation metrics.

Scatter/Gather: a cluster-based approach to browsing large document collections

This work presents a document browsing technique that employs document clustering as its primary operation, and presents fast (linear time) clustering algorithms which support this interactive browsing paradigm.

The Role of Clustering in Search Computing

  • A. CampiS. Ronchi
  • Computer Science
    2009 20th International Workshop on Database and Expert Systems Application
  • 2009
A novel language is proposed in order to explore the results retrieved by several internet search services and search engines that cluster retrieved documents to offer users a tool to discover relevant hidden relationships between clustered documents.

Data clustering: a review

An overview of pattern clustering methods from a statistical pattern recognition perspective is presented, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners.

Similarity Search - The Metric Space Approach

Similarity Search focuses on the state of the art in developing index structures for searching the metric space, and provides an extensive survey of specific techniques for a large range of applications.

Reexamining the cluster hypothesis: scatter/gather on retrieval results

This work systematically evaluates Scatter/Gather in this context and finds significant improvements over similarity search ranking alone and provides evidence validating the cluster hypothesis which states that relevant documents tend to be more similar to each other than to non-relevant documents.