High performance crawling system

Abstract

In the present paper, we will describe the design and implementation of a real-time distributed system of Web crawling running on a cluster of machines. The system crawls several thousands of pages every second, includes a high-performance fault manager, is platform independent and is able to adapt transparently to a wide range of configurations without incurring additional hardware expenditure. We will then provide details of the system architecture and describe the technical choices for very high performance crawling. Finally, we will discuss the experimental results obtained, comparing them with other documented systems

DOI: 10.1145/1026711.1026760

Extracted Key Phrases

4 Figures and Tables

Statistics

051015'05'06'07'08'09'10'11'12'13'14'15'16'17
Citations per Year

52 Citations

Semantic Scholar estimates that this publication has 52 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Hafri2004HighPC, title={High performance crawling system}, author={Younes Hafri and Chabane Djeraba}, booktitle={Multimedia Information Retrieval}, year={2004} }