Observed Web Robot Behavior on Decaying Web Subsites

  title={Observed Web Robot Behavior on Decaying Web Subsites},
  author={Joan A. Smith and Frank McCown and Michael L. Nelson},
  journal={D Lib Mag.},
We describe the observed crawling patterns of various search engines (including Google, Yahoo and MSN) as they traverse a series of web subsites whose contents decay at predetermined rates. We plot the progress of the crawlers through the subsites, and their behaviors regarding the various file types included in the web subsites. We chose decaying subsites because we were originally interested in tracking the implication of using search engine caches for digital preservation. However, some of… 

Evaluation of crawling policies for a web-repository crawler

A web-repository crawler that is used for reconstructing websites when backups are unavailable and three crawling policies which can be used to reconstruct websites are proposed.

The Ethicality of Web Crawlers

A vector space model to represent crawler behavior and a set of models to measure the ethics of web crawlers based on their behaviors are proposed, showing that ethicality scores vary significantly among crawlers.

Characterization of Search Engine Caches

This paper examined the cached contents of Ask, Google, MSN and Yahoo to profile such things as overlap between index and cache, size, MIME type and "staleness" of the cached resources.

Web robot detection techniques: overview and limitations

A framework to classify the existing detection techniques into four categories based on their underlying detection philosophy is proposed, to gain insights into those characteristics that make up an effective robot detection scheme.

Lazy preservation: reconstructing websites by crawling the crawlers

This work introduces "lazy preservation" -- digital preservation performed as a result of the normal operation of web crawlers and caches, especially suitable for third parties; for example, a teacher reconstructing a missing website used in previous classes.

Just-in-time recovery of missing web pages

Opal is an example of "in vivo" preservation: harnessing the collective behavior of web archives, commercial search engines, and research projects for the purpose of preservation.

Website Reconstruction using the Web Infrastructure [ Extended Abstract ]

The concept of “lazy preservation” digital preservation performed as a result of the normal operations of the Web infrastructure (search engines and caches) is introduced and methods for tracking resources as they move through the WI are investigated.

Using the web infrastructure to preserve web pages

This work provides an overview of the ongoing research projects that focus on using the “web infrastructure” to provide preservation capabilities for web pages and examines the overlap these approaches have with the field of information retrieval.

Integrating preservation functions into the web server

This dissertation presents research in which preservation functions have been integrated into the web server itself to produce archive-ready versions of the website's resources, using Sitemaps, and a technical review of the MODOAI web server module which acts as the preservation agent.

Evaluating Personal Archiving Strategies for Internet-based Information

Responses to a survey of people who have recovered lost websites are used to paint a fuller picture of current curatorial strategies and practices, and ways in which expectations of permanence and notification are violated and situations in which benign neglect has far greater consequences for the long-term fate of important digital assets are revealed.



Crawling the Hidden Web

A generic operational model of a hidden Web crawler is introduced and how this model is realized in HiWE (Hidden Web Exposer), a prototype crawler built at Stanford is described.

Crawler-Friendly Web Servers

This paper proposes that web servers export meta-data archives decribing their content so that there are significant bandwidth savings and evaluates simple and easy-to-incorporate modifications to web servers.

Mercator: A scalable, extensible Web crawler

This paper describes Mercator, a scalable, extensible Web crawler written entirely in Java, and comments on Mercator's performance, which is found to be comparable to that of other crawlers for which performance numbers have been published.

Analyzing Web Robots and Their Impact on Caching

The analyses point out that robots cause a signi cant increase in the miss ratio of a server-side cache and the need for a better understanding of the behavior of robots, but also the need of Web caching policies that treat robots' requests di erently than human generated requests.

Finding replicated Web collections

The case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines is made.

The freshness of web search engine databases

It is found that Google performs best overall with the most pages updated on a daily basis, but only MSN is able to update all pages within a time-span of less than 20 days, and both other engines have outliers that are older.

Combating Web Spam with TrustRank

Hierarchical Workload Characterization for a Busy Web Server

The behavioural characteristics that emerge from this study show different features at each level of the Web server access hierarchy and suggest effective stategies for managing resources at busy Internet Web servers.

Reconstructing Websites for the Lazy Webmaster

The concept of “lazy preservation” is introduced- digital preservation performed as a result of the normal operations of the Web infrastructure (search engines and caches) and Warrick, a tool to automate the process of website reconstruction from the Internet Archive, Google, MSN and Yahoo is presented.

Topic-sensitive PageRank

A set of PageRank vectors are proposed, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic, and are shown to generate more accurate rankings than with a single, generic PageRank vector.