Effective page refresh policies for Web crawlers

@article{Cho2003EffectivePR,
  title={Effective page refresh policies for Web crawlers},
  author={Junghoo Cho and Hector Garcia-Molina},
  journal={ACM Trans. Database Syst.},
  year={2003},
  volume={28},
  pages={390-426}
}
In this article, we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Websites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes… 

Effective Page Refresh Policies for Green Web Crawling

The proposed work is mainly motivated by the need to manage updated web data and the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines.

Efficiently Detecting Webpage Updates Using Samples

This paper proposes a set of sampling policies with various downloading granularities, taking into account the link structure, the directory structure, and the content-based features, and investigates the update history and the popularity of the webpages to adaptively model the download probability.

A Hybrid Revisit Policy For Web Search 1

A hybrid approach is proposed that can identify important pages at the early stage of a crawl, and the crawler re-visit these important pages with higher priority.

Change Rate Estimation and Optimal Freshness in Web Page Crawling

This work provides two novel schemes for online estimation of page change rates, both of which prove convergence and derive their convergence rates.

A Hybrid Revisit Policy For Web Search

A hybrid approach is proposed that can identify important pages at the early stage of a crawl, and the crawler re-visit these important pages with higher priority.

Online Algorithms for Estimating Change Rates of Web Pages

Coherence-Oriented Crawling and Navigation Using Patterns for Web Archives

A novel navigation approach is introduced that enables users to browse the most coherent page versions at a given query time and is based on patterns, which models the behavior of the importance of pages changes during a period of time.

Models and methods for web archive crawling

This thesis presents a model for assessing the data quality in Web archives as well as a family of crawling strategies yielding high-quality captures, and proposes visualization techniques for exploring the quality of the resulting Web archives.

Clustering-based incremental web crawling

A crawling algorithm is designed that clusters Web pages based on features that correlate to their change frequencies obtained by examining past history, and this algorithm outperforms existing algorithms in that it can improve the quality of the user experience for those who query the search engine.

Keeping a Search Engine Index Fresh : Risk and optimality in estimating refresh rates for web pages

A Poisson process model for the number of state changes of a page is considered, where a crawler samples the page at some known (but variable) time interval and observes whether or not the page has changed in during that interval.
...

References

SHOWING 1-10 OF 45 REFERENCES

Design and implementation of a high-performance distributed Web crawler

This paper describes the design and implementation of a distributed Web crawler that runs on a network of workstations that scales to several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications.

The Evolution of the Web and Implications for an Incremental Crawler

An architecture for the incremental crawler is proposed, which combines the best design choices, which can improve the ``freshness'' of the collection significantly and bring in new pages in a more timely manner.

Evaluating topic-driven web crawlers

This work proposes three different methods to evaluate crawling strategies and applies the proposed metrics to compare three topic-driven crawling algorithms based on similarity ranking, link analysis, and adaptive agents.

Rate of Change and other Metrics: a Live Study of the World Wide Web

The potential benefit of a shared proxy-caching server in a large environment is quantified by using traces that were collected at the Internet connection points for two large corporations, representing significant numbers of references.

Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery

Focused Crawling Using Context Graphs

A focused crawling algorithm is presented that builds a model for the context within which topically relevant pages occur on the web that can capture typical link hierarchies within which valuable pages occur, as well as model content on documents that frequently cooccur with relevant pages.

Efficient Crawling Through URL Ordering

Estimating frequency of change

The case for estimating the change frequency of data to improve Web crawlers, Web caches and to help data mining is made and several "frequency estimators" are developed.

Parallel crawlers

This paper proposes multiple architectures for a parallel crawller, identifies fundamental issues related to parallel crawling, and proposes metrics to evaluate a parallel crawler, and compares the proposed architectures using 40 million pages collected from the Web.