Effective page refresh policies for Web crawlers

@article{Cho2003EffectivePR,
  title={Effective page refresh policies for Web crawlers},
  author={Junghoo Cho and Hector Garcia-Molina},
  journal={ACM Trans. Database Syst.},
  year={2003},
  volume={28},
  pages={390-426}
}
In this article, we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Websites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes… Expand
Effective Page Refresh Policies for Green Web Crawling
On web large no. of requests goes to server. Due to large no. of HTTP requests to web servers increase the energy consumption and carbon footprint of the web servers and for that computationalExpand
Efficiently Detecting Webpage Updates Using Samples
TLDR
This paper proposes a set of sampling policies with various downloading granularities, taking into account the link structure, the directory structure, and the content-based features, and investigates the update history and the popularity of the webpages to adaptively model the download probability. Expand
A Hybrid Revisit Policy For Web Search 1
A crawler is a program that retrieves and stores pages from the Web, commonly for a Web search engine. A crawler often has to download hundreds of millions of pages in a short period of time and hasExpand
Change Rate Estimation and Optimal Freshness in Web Page Crawling
TLDR
This work provides two novel schemes for online estimation of page change rates, both of which prove convergence and derive their convergence rates. Expand
A Hybrid Revisit Policy For Web Search
TLDR
A hybrid approach is proposed that can identify important pages at the early stage of a crawl, and the crawler re-visit these important pages with higher priority. Expand
Online Algorithms for Estimating Change Rates of Web Pages
TLDR
This work provides three novel schemes for online estimation of page change rates, one of which is the first convergence type result for a stochastic approximation algorithm with momentum, and some numerical experiments to compare the performance of the proposed estimators with the existing ones. Expand
Coherence-Oriented Crawling and Navigation Using Patterns for Web Archives
TLDR
A novel navigation approach is introduced that enables users to browse the most coherent page versions at a given query time and is based on patterns, which models the behavior of the importance of pages changes during a period of time. Expand
Models and methods for web archive crawling
TLDR
This thesis presents a model for assessing the data quality in Web archives as well as a family of crawling strategies yielding high-quality captures, and proposes visualization techniques for exploring the quality of the resulting Web archives. Expand
Clustering-based incremental web crawling
TLDR
A crawling algorithm is designed that clusters Web pages based on features that correlate to their change frequencies obtained by examining past history, and this algorithm outperforms existing algorithms in that it can improve the quality of the user experience for those who query the search engine. Expand
Keeping a Search Engine Index Fresh : Risk and optimality in estimating refresh rates for web pages
Search engines strive to maintain a “current” repository of all web pages on the internet to index for user queries. However, refreshing all web pages all the time is costly and inefficient: manyExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 47 REFERENCES
Crawling the web : discovery and maintenance of large-scale web data
TLDR
This dissertation studies how to build an effective Web crawler that can retrieve “high quality” pages quickly, while maintaining the retrieved pages “fresh,” and explores how to parallelize a crawling process to maximize the download rate while minimizing the overhead from parallelization. Expand
Design and implementation of a high-performance distributed Web crawler
TLDR
This paper describes the design and implementation of a distributed Web crawler that runs on a network of workstations that scales to several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. Expand
The Evolution of the Web and Implications for an Incremental Crawler
TLDR
An architecture for the incremental crawler is proposed, which combines the best design choices, which can improve the ``freshness'' of the collection significantly and bring in new pages in a more timely manner. Expand
Towards a Better Understanding of Web Resources and Server Responses for Improved Caching
TLDR
Results from the work indicate that there is potential to reuse more cached resources than is currently being realized due to inaccurate and nonexistent cache directives, and that separating out the dynamic portions of a page into their own resources allows relatively static portions to be cached. Expand
Evaluating topic-driven web crawlers
TLDR
This work proposes three different methods to evaluate crawling strategies and applies the proposed metrics to compare three topic-driven crawling algorithms based on similarity ranking, link analysis, and adaptive agents. Expand
Effective page refresh policies for Web crawlers
In this article, we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web...
Rate of Change and other Metrics: a Live Study of the World Wide Web
TLDR
The potential benefit of a shared proxy-caching server in a large environment is quantified by using traces that were collected at the Internet connection points for two large corporations, representing significant numbers of references. Expand
Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery
TLDR
A new hypertext resource discovery system called a Focused Crawler that is robust against large perturbations in the starting set of URLs, and capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius. Expand
Focused Crawling Using Context Graphs
TLDR
A focused crawling algorithm is presented that builds a model for the context within which topically relevant pages occur on the web that can capture typical link hierarchies within which valuable pages occur, as well as model content on documents that frequently cooccur with relevant pages. Expand
Efficient Crawling Through URL Ordering
TLDR
This paper studies in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first, and shows that a Crawler with a good ordering scheme can obtain important pages significantly faster than one without. Expand
...
1
2
3
4
5
...