TS-IDS Algorithm for Query Selection in the Deep Web Crawling

Abstract

The deep web crawling is the process of collecting data items inside a data source hidden behind searchable interfaces. Since the only method to access the data is by sending queries, one of the research challenges is the selection of a set of queries such that they can retrieve most of the data with minimal network traffic. This is a set covering problem that is NP-hard. The large size of the problem, in terms of both large number of documents and terms involved, calls for new approximation algorithms for efficient deep web data crawling. Inspired by the TFIDF weighting measure in information retrieval, this paper proposes the TS-IDS algorithm that assigns an importance value to each document proportional to term size (TS), and inversely proportional to document size (IDS). The algorithm is extensively tested on a variety of datasets, and compared with the traditional greedy algorithm and the more recent IDS algorithm. We demonstrate that TS-IDS outperforms the greedy algorithm and IDS algorithm up to 33% and 29%, respectively. Our work also makes a contribution to the classic set covering problem by leveraging the long-tail distributions of the terms and documents in natural languages. Since long-tail distribution is ubiquitous in real world, our approach can be applied in areas other than the deep web crawling.

DOI: 10.1007/978-3-319-11116-2_17

Extracted Key Phrases

6 Figures and Tables

Cite this paper

@inproceedings{Wang2014TSIDSAF, title={TS-IDS Algorithm for Query Selection in the Deep Web Crawling}, author={Yan Wang and Jianguo Lu and Jessica Chen}, booktitle={APWeb}, year={2014} }