IRLbot: scaling to 6 billion pages and beyond

Abstract

This article shares our experience in designing a Web crawler that can download billions of pages using a single-server implementation and models its performance. We first show that current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly branching spam, legitimate multimillion-page blog sites, and… (More)
DOI: 10.1145/1541822.1541823
View Slides

Topics

10 Figures and Tables

Statistics

0204020082009201020112012201320142015201620172018
Citations per Year

122 Citations

Semantic Scholar estimates that this publication has 122 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@article{Lee2008IRLbotST, title={IRLbot: scaling to 6 billion pages and beyond}, author={Hsin-Tsang Lee and Derek Leonard and Xiaoming Wang and Dmitri Loguinov}, journal={TWEB}, year={2008}, volume={3}, pages={8:1-8:34} }