Observed Web Robot Behavior on Decaying Web Subsites

@article{Smith2006ObservedWR,
  title={Observed Web Robot Behavior on Decaying Web Subsites},
  author={Joan A. Smith and Frank McCown and Michael L. Nelson},
  journal={D Lib Mag.},
  year={2006},
  volume={12}
}
We describe the observed crawling patterns of various search engines (including Google, Yahoo and MSN) as they traverse a series of web subsites whose contents decay at predetermined rates. We plot the progress of the crawlers through the subsites, and their behaviors regarding the various file types included in the web subsites. We chose decaying subsites because we were originally interested in tracking the implication of using search engine caches for digital preservation. However, some of… 

Evaluation of crawling policies for a web-repository crawler

TLDR
A web-repository crawler that is used for reconstructing websites when backups are unavailable and three crawling policies which can be used to reconstruct websites are proposed.

The Ethicality of Web Crawlers

TLDR
A vector space model to represent crawler behavior and a set of models to measure the ethics of web crawlers based on their behaviors are proposed, showing that ethicality scores vary significantly among crawlers.

Characterization of Search Engine Caches

TLDR
This paper examined the cached contents of Ask, Google, MSN and Yahoo to profile such things as overlap between index and cache, size, MIME type and "staleness" of the cached resources.

Web robot detection techniques: overview and limitations

TLDR
A framework to classify the existing detection techniques into four categories based on their underlying detection philosophy is proposed, to gain insights into those characteristics that make up an effective robot detection scheme.

Lazy preservation: reconstructing websites by crawling the crawlers

TLDR
This work introduces "lazy preservation" -- digital preservation performed as a result of the normal operation of web crawlers and caches, especially suitable for third parties; for example, a teacher reconstructing a missing website used in previous classes.

Just-in-time recovery of missing web pages

TLDR
Opal is an example of "in vivo" preservation: harnessing the collective behavior of web archives, commercial search engines, and research projects for the purpose of preservation.

Website Reconstruction using the Web Infrastructure [ Extended Abstract ]

TLDR
The concept of “lazy preservation” digital preservation performed as a result of the normal operations of the Web infrastructure (search engines and caches) is introduced and methods for tracking resources as they move through the WI are investigated.

Using the web infrastructure to preserve web pages

TLDR
This work provides an overview of the ongoing research projects that focus on using the “web infrastructure” to provide preservation capabilities for web pages and examines the overlap these approaches have with the field of information retrieval.

Integrating preservation functions into the web server

TLDR
This dissertation presents research in which preservation functions have been integrated into the web server itself to produce archive-ready versions of the website's resources, using Sitemaps, and a technical review of the MODOAI web server module which acts as the preservation agent.

Evaluating Personal Archiving Strategies for Internet-based Information

TLDR
Responses to a survey of people who have recovered lost websites are used to paint a fuller picture of current curatorial strategies and practices, and ways in which expectations of permanence and notification are violated and situations in which benign neglect has far greater consequences for the long-term fate of important digital assets are revealed.

References

SHOWING 1-10 OF 18 REFERENCES

Crawling the Hidden Web

TLDR
A generic operational model of a hidden Web crawler is introduced and how this model is realized in HiWE (Hidden Web Exposer), a prototype crawler built at Stanford is described.

Crawler-Friendly Web Servers

TLDR
This paper proposes that web servers export meta-data archives decribing their content so that there are significant bandwidth savings and evaluates simple and easy-to-incorporate modifications to web servers.

Mercator: A scalable, extensible Web crawler

TLDR
This paper describes Mercator, a scalable, extensible Web crawler written entirely in Java, and comments on Mercator's performance, which is found to be comparable to that of other crawlers for which performance numbers have been published.

Analyzing Web Robots and Their Impact on Caching

TLDR
The analyses point out that robots cause a signi cant increase in the miss ratio of a server-side cache and the need for a better understanding of the behavior of robots, but also the need of Web caching policies that treat robots' requests di erently than human generated requests.

Finding replicated Web collections

TLDR
The case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines is made.

The freshness of web search engine databases

TLDR
It is found that Google performs best overall with the most pages updated on a daily basis, but only MSN is able to update all pages within a time-span of less than 20 days, and both other engines have outliers that are older.

Combating Web Spam with TrustRank

Hierarchical Workload Characterization for a Busy Web Server

TLDR
The behavioural characteristics that emerge from this study show different features at each level of the Web server access hierarchy and suggest effective stategies for managing resources at busy Internet Web servers.

Reconstructing Websites for the Lazy Webmaster

TLDR
The concept of “lazy preservation” is introduced- digital preservation performed as a result of the normal operations of the Web infrastructure (search engines and caches) and Warrick, a tool to automate the process of website reconstruction from the Internet Archive, Google, MSN and Yahoo is presented.

Topic-sensitive PageRank

TLDR
A set of PageRank vectors are proposed, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic, and are shown to generate more accurate rankings than with a single, generic PageRank vector.