Focused Crawling by Learning HMM from User's Topic-specific Browsing


A focused crawler is designed to traverse the Web to gather documents on a specific topic. It is not an easy task to predict which links lead to good pages. In this paper, we present a new approach for prediction of the important links to relevant pages based on a learned user model. In particular, we first collect pages that a user visits during a learning session, where the user browses the Web and specifically marks which pages she is interested in. We then examine the semantic content of these pages to construct a concept graph, which is used to learn the dominant content and link structure leading to target pages using a Hidden Markov Model (HMM). Experiments show that with learned HMM from a user's browsing, the crawling performs better than Best-First strategy.

DOI: 10.1109/WI.2004.70

@article{Liu2004FocusedCB, title={Focused Crawling by Learning HMM from User's Topic-specific Browsing}, author={Hongyu Liu and Evangelos E. Milios and Jeannette C. M. Janssen}, journal={IEEE/WIC/ACM International Conference on Web Intelligence (WI'04)}, year={2004}, pages={732-732} }