An adaptive model for optimizing performance of an incremental web crawler

  title={An adaptive model for optimizing performance of an incremental web crawler},
  author={Jenny Edwards and Kevin S. McCurley and John A. Tomlin},
  booktitle={The Web Conference},
This paper outlines the design of a web crawler implemented for IBM Almaden's WebFountain project and describes an optimization model for controlling the crawl strategy. This crawler is scalable and incremental. The model makes no assumptions about the statistical behaviour of web page changes, but rather uses an adaptive approach to maintain data on actual change rates which are in turn used as inputs for the optimization. Computational results with simulated but realistic data show that there… 

Figures and Tables from this paper

Scheduling algorithms for Web crawling

It is shown that a combination of breadth-first ordering with the largest sites first is a practical alternative since it is fast, simple to implement, and able to retrieve the best ranked pages at a rate that is closer to the optimal than other alternatives.

A probabilistic model for intelligent Web crawlers

  • Ke HuW. Wong
  • Computer Science
    Proceedings 27th Annual International Computer Software and Applications Conference. COMPAC 2003
  • 2003
A simple model is outlined to predict the distribution of the search depth in a breadth-first search to reach the first Web pages relevant to a user query and this probability is defined as the crawler confidence.

A Scalable Lightweight Distributed Crawler for Crawling with Limited Resources

This paper proposes a distributed crawling concept which is designed to avoid approximations, to limit the network overhead, and to run on relatively inexpensive hardware.

The anatomy of web crawlers

A survey of different architectures of web crawlers along with their comparisons has been carried out that takes into account various important features like scalability, manageability, page refresh policy, politeness policy etc.

Analysis of priority and partitioning effects on web crawling performance

The main purpose of this paper is to analyze how the importance factor, multi-crawling and partitioning affect on the freshness of the web page repository of a typical search engine.

Distributed High-performance Web Crawlers : A Survey of the State of the Art

Web Crawlers (also called Web Spiders or Robots), are programs used to download documents from the internet. Simple crawlers can be used by individuals to copy an entire web site to their hard drive

An Architecture for Efficient Web Crawling

A crawler supported by a web page classifier that uses solely a page URL to determine page relevance is proposed, which reduces the number of unnecessary pages downloaded, minimising bandwidth and making it efficient and suitable for virtual integration systems.

Web Crawling By Christopher Olston and

The fundamental challenges of web crawling are outlined and the state-of-the-art models and solutions are described, and avenues for future work are highlighted.

Crawling a country: better strategies than breadth-first for web page ordering

This article proposes several page ordering strategies that are more efficient than breadth- first search and strategies based on partial Pagerank calculations that are compared under several metrics.

Agent-Based Approach for Web Crawling

An agent-based approach, through three scenarios, for parallel and distributed Web crawling is presented and it is shown that the cloning based mobile agents scenario outperforms the single and multiple mobile agents scenarios.



The Evolution of the Web and Implications for an Incremental Crawler

An architecture for the incremental crawler is proposed, which combines the best design choices, which can improve the ``freshness'' of the collection significantly and bring in new pages in a more timely manner.

Efficient Crawling Through URL Ordering

How dynamic is the Web?

Rate of Change and other Metrics: a Live Study of the World Wide Web

The potential benefit of a shared proxy-caching server in a large environment is quantified by using traces that were collected at the Internet connection points for two large corporations, representing significant numbers of references.

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Syntactic Clustering of the Web

Optimal Robot Scheduling for Web Search Engines

This paper studies robot scheduling policies that minimize the fractions of time pages spend out-of-date, assuming independent Poisson page-change processes, and a general distribution for the page access time $X.

Keeping up with the changing Web

What "current" means for Web search engines and how often they must reindex the Web to keep current with its changing pages and structure are quantified.

Synchronizing a database to improve freshness

This paper studies how to refresh a local copy of an autonomous data source to maintain the copy up-to-date, and defines two freshness metrics, change models of the underlying data, and synchronization policies.