UbiCrawler: a scalable fully distributed Web crawler

@article{Boldi2004UbiCrawlerAS,
  title={UbiCrawler: a scalable fully distributed Web crawler},
  author={Paolo Boldi and Bruno Codenotti and Massimo Santini and Sebastiano Vigna},
  journal={Software: Practice and Experience},
  year={2004},
  volume={34}
}
We report our experience in implementing UbiCrawler, a scalable distributed Web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some… 

A Scalable Lightweight Distributed Crawler for Crawling with Limited Resources

This paper proposes a distributed crawling concept which is designed to avoid approximations, to limit the network overhead, and to run on relatively inexpensive hardware.

Distributed High-performance Web Crawlers : A Survey of the State of the Art

Web Crawlers (also called Web Spiders or Robots), are programs used to download documents from the internet. Simple crawlers can be used by individuals to copy an entire web site to their hard drive

Scrawler: A Seed-By-Seed Parallel Web Crawler

This paper presents the design and implementation of an effective parallel web crawler, and investigates the URL distributor for URL balancing and the scalability of the crawler.

A Scalable P2P RIA Crawling System with Fault Tolerance

This research addresses the scalability and resilience problems when crawling RIAs in a distributed environment and explores the possibilities of designing an efficient RIA crawling system that is scalable and fault-tolerant.

Building a Peer-to-Peer , domain specific web crawler

The building blocks of PeerCrawl a Peer-to-Peer web crawler is described, which can be used for generic crawling, is easily scalable and can be implemented on a grid of day- to-day use computers.

Around the web in six weeks: Documenting a large-scale crawl

An extensive measurement study of the collected dataset is undertaken and a framework for modeling the scaling rate of various data structures as crawl size goes to infinity is proposed and a methodology for comparing crawl coverage to that of commercial search engines is offered.

UniCrawl: A Practical Geographically Distributed Web Crawler

This paper presents a geo-distributed crawler solution, UniCrawl, that orchestrates several geographically distributed sites and splits the crawled domain space across the sites and federates their storage and computing resources, while minimizing thee inter-site communication cost.

An Efficient Multi-Threaded Web Crawler Using HashMaps

In this paper, an efficient multi-threaded web crawler is proposed, and empirically analyzed in terms of crawling speed and coverage.

Design and Implementation of an Efficient Distributed Web Crawler with Scalable Architecture

A distributed crawler system which consists of multiple controllers and takes the advantages of both two architecture, which involves a fully distributed architecture, a strategy to assign tasks and a method to assure system scalability.

Tarantula - A Scalable and Extensible Web Spider

The structure of the crawler facilitates new navigation techniques which can be used with existing techniques to give improved crawl results and a comparison with the Heritrix (Mohr et al.) crawler is presented.
...

References

SHOWING 1-10 OF 29 REFERENCES

Parallel crawlers

This paper proposes multiple architectures for a parallel crawller, identifies fundamental issues related to parallel crawling, and proposes metrics to evaluate a parallel crawler, and compares the proposed architectures using 40 million pages collected from the Web.

Design and implementation of a high-performance distributed Web crawler

This paper describes the design and implementation of a distributed Web crawler that runs on a network of workstations that scales to several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications.

Design and Implementation of a Distributed Crawler and Filtering Processor

This paper presents the architecture and implementation of, and experimentation with WebRACE, a high-performance, distributedWeb crawler, filtering server and object cache, designed in the context of eRace, an extensible Retrieval Annotation Caching Engine, which collects, annotates and disseminates information from heterogeneous Internet sources and protocols according to XML-encoded user profiles.

High-performance web crawling

This chapter describes the experience building and operating a high-performance crawler, which is an important component of many web services and uses data structures far too large to fit in main memory to access and update them efficiently.

Architectural design and evaluation of an efficient web-crawling system

Web Caching with Consistent Hashing

Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web

A family of caching protocols for distrib-uted networks that can be used to decrease or eliminate the occurrence of hot spots in the network, based on a special kind of hashing that is called consistent hashing.

Performance limitations of the Java core libraries

This paper describes the most serious pitfalls using Java to build a high-performance web crawler, and how workarounds more than doubled the speed of the program.

Breadth-first crawling yields high-quality pages

This paper examines the average page quality over time of pages downloaded during a web crawl of 328 million unique pages. We use the connectivity-based metric PageRank to measure the quality of a

The Anatomy of a Large-Scale Hypertextual Web Search Engine