UbiCrawler: a scalable fully distributed Web crawler
@article{Boldi2004UbiCrawlerAS, title={UbiCrawler: a scalable fully distributed Web crawler}, author={Paolo Boldi and Bruno Codenotti and Massimo Santini and Sebastiano Vigna}, journal={Software: Practice and Experience}, year={2004}, volume={34} }
We report our experience in implementing UbiCrawler, a scalable distributed Web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some…
618 Citations
A Scalable Lightweight Distributed Crawler for Crawling with Limited Resources
- Computer Science2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology
- 2008
This paper proposes a distributed crawling concept which is designed to avoid approximations, to limit the network overhead, and to run on relatively inexpensive hardware.
Distributed High-performance Web Crawlers : A Survey of the State of the Art
- Computer Science
- 2003
Web Crawlers (also called Web Spiders or Robots), are programs used to download documents from the internet. Simple crawlers can be used by individuals to copy an entire web site to their hard drive…
Scrawler: A Seed-By-Seed Parallel Web Crawler
- Computer ScienceICE-B
- 2007
This paper presents the design and implementation of an effective parallel web crawler, and investigates the URL distributor for URL balancing and the scalability of the crawler.
A Scalable P2P RIA Crawling System with Fault Tolerance
- Computer Science
- 2016
This research addresses the scalability and resilience problems when crawling RIAs in a distributed environment and explores the possibilities of designing an efficient RIA crawling system that is scalable and fault-tolerant.
Building a Peer-to-Peer , domain specific web crawler
- Computer Science
- 2006
The building blocks of PeerCrawl a Peer-to-Peer web crawler is described, which can be used for generic crawling, is easily scalable and can be implemented on a grid of day- to-day use computers.
UniCrawl: A Practical Geographically Distributed Web Crawler
- Computer Science2015 IEEE 8th International Conference on Cloud Computing
- 2015
This paper presents a geo-distributed crawler solution, UniCrawl, that orchestrates several geographically distributed sites and splits the crawled domain space across the sites and federates their storage and computing resources, while minimizing thee inter-site communication cost.
Design and Implementation of an Efficient Distributed Web Crawler with Scalable Architecture
- Computer Science
- 2010
A distributed crawler system which consists of multiple controllers and takes the advantages of both two architecture, which involves a fully distributed architecture, a strategy to assign tasks and a method to assure system scalability.
Tarantula - A Scalable and Extensible Web Spider
- Computer ScienceKMIS
- 2009
The structure of the crawler facilitates new navigation techniques which can be used with existing techniques to give improved crawl results and a comparison with the Heritrix (Mohr et al.) crawler is presented.
A Full Distributed Web Crawler Based on Structured Network
- Computer ScienceAIRS
- 2008
A novel full distributed Web crawler system which is based on structured network is provided, and a distributed crawling model is developed and applied in it which improves the performance of the system.
The Viuva Negra crawler
- Computer Science
- 2006
The design, implementation and evaluation of the Viuva Negra (VN) crawler is detailed, feeding a search engine and an archive of the Portuguese web and describing hazardous situations to crawling found on the web and the adopted solutions to mitigate their effects.
References
SHOWING 1-10 OF 29 REFERENCES
Parallel crawlers
- Computer ScienceWWW '02
- 2002
This paper proposes multiple architectures for a parallel crawller, identifies fundamental issues related to parallel crawling, and proposes metrics to evaluate a parallel crawler, and compares the proposed architectures using 40 million pages collected from the Web.
Design and implementation of a high-performance distributed Web crawler
- Computer ScienceProceedings 18th International Conference on Data Engineering
- 2002
This paper describes the design and implementation of a distributed Web crawler that runs on a network of workstations that scales to several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications.
Design and Implementation of a Distributed Crawler and Filtering Processor
- Computer ScienceNGITS
- 2002
This paper presents the architecture and implementation of, and experimentation with WebRACE, a high-performance, distributedWeb crawler, filtering server and object cache, designed in the context of eRace, an extensible Retrieval Annotation Caching Engine, which collects, annotates and disseminates information from heterogeneous Internet sources and protocols according to XML-encoded user profiles.
High-performance web crawling
- Computer Science
- 2002
This chapter describes the experience building and operating a high-performance crawler, which is an important component of many web services and uses data structures far too large to fit in main memory to access and update them efficiently.
Architectural design and evaluation of an efficient web-crawling system
- Computer ScienceProceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001
- 2001
Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web
- Computer ScienceSTOC '97
- 1997
A family of caching protocols for distrib-uted networks that can be used to decrease or eliminate the occurrence of hot spots in the network, based on a special kind of hashing that is called consistent hashing.
Performance limitations of the Java core libraries
- Computer ScienceJAVA '99
- 1999
This paper describes the most serious pitfalls using Java to build a high-performance web crawler, and how workarounds more than doubled the speed of the program.
Breadth-first crawling yields high-quality pages
- Computer Science, EngineeringWWW '01
- 2001
This paper examines the average page quality over time of pages downloaded during a web crawl of 328 million unique pages. We use the connectivity-based metric PageRank to measure the quality of a…
The Anatomy of a Large-Scale Hypertextual Web Search Engine
- Computer ScienceComput. Networks
- 1998