Crawler-Friendly Web Servers

@article{Brandman2000CrawlerFriendlyWS,
  title={Crawler-Friendly Web Servers},
  author={Onn Brandman and Junghoo Cho and Hector Garcia-Molina and Narayanan Shivakumar},
  journal={SIGMETRICS Perform. Evaluation Rev.},
  year={2000},
  volume={28},
  pages={9-14}
}
In this paper we study how to make web servers (e.g., Apache) more crawler friendly. Current web servers offer the same interface to crawlers and regular web surfers, even though crawlers and surfers have very different performance requirements. We evaluate simple and easy-to-incorporate modifications to web servers so that there are significant bandwidth savings. Specifically, we propose that web servers export meta-data archives decribing their content. 

Figures from this paper

Crawlets: Agents for High Performance Web Search Engines

An implementation of conventional crawling in which a search engine uploads simple agents, called crawlets, to web sites, that requires no changes to web servers, but only the installation of a few (active) web pages at host sites.

Scheduling algorithms for Web crawling

It is shown that a combination of breadth-first ordering with the largest sites first is a practical alternative since it is fast, simple to implement, and able to retrieve the best ranked pages at a rate that is closer to the optimal than other alternatives.

Implementation and Evaluation of an Architecture for Web Search Engine Freshness

This work demonstrates that implementing FreshFlow incurs a low penalty at an Apache web server, and uses trace-driven simulations to show that the algorithm used in FreshFlow performs much better than other naive algorithms.

Cooperation schemes between a Web server and a Web search engine

  • C. Castillo
  • Computer Science
    Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726)
  • 2003
This work explores and compares several alternatives for keeping fresh repositories of cooperation from servers and concludes that polling is the only method for detecting changes.

Design of an Efficient Migrating Crawler based on Sitemaps

The information provided by sitemap protocol is used for the purpose of crawling the quality web pages and helps crawler to visit the pages based on their change frequency and downloads updated pages only, thereby reduces unnecessary network traffic.

A Cooperative Approach to Web Crawler URL Ordering

A novel URL ordering system that relies on a cooperative approach between crawlers and web servers based on file system and Web log information, which is able to retrieve high quality pages earlier in the crawl while avoiding requests for pages that are unchanged or no longer available.

A Co-operative Web Services Paradigm for Supporting Crawlers

This work proposes a new Web services paradigm for Website/crawler interaction that is co-operative and exploits the information present in the Web logs and file system and presents experimental results demonstrating that this approach provides bandwidth savings, more complete Web page collections, and collections that are notified of deleted pages.

An Efficient Technique to Reduce Network Load during Web Crawling

The mobile crawlers filter out pages that are not modified since last crawl before sending them to the search engine for indexing pupose and to achieve this the old web page is compared with the new web page.

Designing clustering-based web crawling policies for search engine crawlers

The results demonstrate that the clustering algorithm effectively clusters the pages with similar change patterns, and the solution significantly outperforms the existing methods in that it can detect more changed webpages and improve the quality of the user experience for those who query the search engine.

Crawling a country: better strategies than breadth-first for web page ordering

This article proposes several page ordering strategies that are more efficient than breadth- first search and strategies based on partial Pagerank calculations that are compared under several metrics.
...

References

SHOWING 1-10 OF 27 REFERENCES

Improving HTTP Latency

Accessibility of information on the web

As the web becomes a major communications medium, the data on it must be made more accessible, and search engines need to make the data more accessible.

Generating representative Web workloads for network and server performance evaluation

This paper applies a number of observations of Web server usage to create a realistic Web workload generation tool which mimics a set of real users accessing a server and addresses the technical challenges to satisfying this large set of simultaneous constraints on the properties of the reference stream.

Synchronizing a database to improve freshness

This paper studies how to refresh a local copy of an autonomous data source to maintain the copy up-to-date, and defines two freshness metrics, change models of the underlying data, and synchronization policies.

STARTS: Stanford Protocol Proposal for Internet Retrieval and Search

The Digital Library project at Stanford has coordinated among search-engine vendors and other key players to reach informal agreements for unifying basic interactions in these three areas, and this is the final writeup of the informal "standards" effort.

Optimal Robot Scheduling for Web Search Engines

This paper studies robot scheduling policies that minimize the fractions of time pages spend out-of-date, assuming independent Poisson page-change processes, and a general distribution for the page access time $X.

An Introduction to the Resource Description Framework

Abstract The Resource Description Framework (RDF) is an infrastructure that enables the encoding, exchange and reuse of structured metadata. RDF is an application of XML that imposes needed

Accessible at http://harvest.transarc.com/afs/transarc.com/public/trg/Harvest/user-manual

  • Harvest user's manual,
  • 1996

Harvest user 's manual, Jan

  • 1996