UbiCrawler: a scalable fully distributed Web crawler

@article{Boldi2004UbiCrawlerAS,
  title={UbiCrawler: a scalable fully distributed Web crawler},
  author={Paolo Boldi and Bruno Codenotti and Massimo Santini and Sebastiano Vigna},
  journal={Software: Practice and Experience},
  year={2004},
  volume={34}
}
We report our experience in implementing UbiCrawler, a scalable distributed Web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some… Expand
A Scalable Lightweight Distributed Crawler for Crawling with Limited Resources
  • Milly Kc, M. Hagenbuchner, A. Tsoi
  • Computer Science, Materials Science
  • 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology
  • 2008
TLDR
This paper proposes a distributed crawling concept which is designed to avoid approximations, to limit the network overhead, and to run on relatively inexpensive hardware. Expand
Distributed High-performance Web Crawlers : A Survey of the State of the Art
Web Crawlers (also called Web Spiders or Robots), are programs used to download documents from the internet. Simple crawlers can be used by individuals to copy an entire web site to their hard driveExpand
Scrawler: A Seed-By-Seed Parallel Web Crawler
TLDR
This paper presents the design and implementation of an effective parallel web crawler, and investigates the URL distributor for URL balancing and the scalability of the crawler. Expand
A Scalable P2P RIA Crawling System with Fault Tolerance
Rich Internet Applications (RIAs) have been widely used in the web over the last decade as they were found to be responsive and user-friendly compared to traditional web applications. RIAs useExpand
Building a Peer-to-Peer , domain specific web crawler
The introduction of a crawler in mid 90s opened the floodgates for research in various application domains. Many attempts to create an ideal crawler failed due to the explosive nature of the web. InExpand
Around the web in six weeks: Documenting a large-scale crawl
TLDR
An extensive measurement study of the collected dataset is undertaken and a framework for modeling the scaling rate of various data structures as crawl size goes to infinity is proposed and a methodology for comparing crawl coverage to that of commercial search engines is offered. Expand
UniCrawl: A Practical Geographically Distributed Web Crawler
TLDR
This paper presents a geo-distributed crawler solution, UniCrawl, that orchestrates several geographically distributed sites and splits the crawled domain space across the sites and federates their storage and computing resources, while minimizing thee inter-site communication cost. Expand
BUbiNG: Massive Crawling for the Masses
TLDR
The description of BUbiNG, the next-generation web crawler built upon the authors’ experience with UbiCrawler and on the last ten years of research on the topic, is described. Expand
An Efficient Multi-Threaded Web Crawler Using HashMaps
TLDR
In this paper, an efficient multi-threaded web crawler is proposed, and empirically analyzed in terms of crawling speed and coverage. Expand
Design and Implementation of an Efficient Distributed Web Crawler with Scalable Architecture
Distributed Web crawlers have recently received more and more attention from researchers. Centralized solutions are known to have problems like link congestion, being a single point of failure ,whileExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 41 REFERENCES
Parallel crawlers
TLDR
This paper proposes multiple architectures for a parallel crawller, identifies fundamental issues related to parallel crawling, and proposes metrics to evaluate a parallel crawler, and compares the proposed architectures using 40 million pages collected from the Web. Expand
Design and implementation of a high-performance distributed Web crawler
TLDR
This paper describes the design and implementation of a distributed Web crawler that runs on a network of workstations that scales to several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. Expand
Design and Implementation of a Distributed Crawler and Filtering Processor
TLDR
This paper presents the architecture and implementation of, and experimentation with WebRACE, a high-performance, distributedWeb crawler, filtering server and object cache, designed in the context of eRace, an extensible Retrieval Annotation Caching Engine, which collects, annotates and disseminates information from heterogeneous Internet sources and protocols according to XML-encoded user profiles. Expand
High-performance web crawling
TLDR
This chapter describes the experience building and operating a high-performance crawler, which is an important component of many web services and uses data structures far too large to fit in main memory to access and update them efficiently. Expand
Architectural design and evaluation of an efficient Web-crawling system
TLDR
An architectural design and evaluation result of an efficient Web-crawling system that has been successfully integrated with WebGather, a well-known Chinese and English Web search engine aimed at collecting all the Web pages in China and keeping pace with the rapid growth of Chinese Web information. Expand
Web Caching with Consistent Hashing
TLDR
This paper describes the implementation of a consistent-hashing-based system and experiments that support the thesis that it can provide performance improvements, and provides an alternative to multicast and directory schemes and has several other advantages in load balancing and fault tolerance. Expand
Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web
TLDR
A family of caching protocols for distrib-uted networks that can be used to decrease or eliminate the occurrence of hot spots in the network, based on a special kind of hashing that is called consistent hashing. Expand
Performance limitations of the Java core libraries
TLDR
This paper describes the most serious pitfalls using Java to build a scalable web crawler, and how workarounds more than doubled the speed of the crawler. Expand
Breadth-first crawling yields high-quality pages
This paper examines the average page quality over time of pages downloaded during a web crawl of 328 million unique pages. We use the connectivity-based metric PageRank to measure the quality of aExpand
The Anatomy of a Large-Scale Hypertextual Web Search Engine
TLDR
This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. Expand
...
1
2
3
4
5
...