Semantic Website Clustering


We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic feature vectors from user click-through data. We then use LSA (Latent Semantic Analysis) to reduce the feature dimension and K-means algorithm to cluster documents. Compared to the traditional way of feature extraction (lexical binomial model), our new model has better purity (75%) and F-measure (52%). We can further use features combined from both methods and reach purity 82% and F-measure 52%. Moreover, the same method can be used to cluster queries, and with the result purity 74% and F-measure 43%.

9 Figures and Tables

Cite this paper

@inproceedings{Yang2007SemanticWC, title={Semantic Website Clustering}, author={I-Hsuan Yang and Yu-tsun Huang and Yen-Ling Huang}, year={2007} }