Sic transit gloria telae: towards an understanding of the web's decay

  title={Sic transit gloria telae: towards an understanding of the web's decay},
  author={Ziv Bar-Yossef and Andrei Z. Broder and Ravi Kumar and Andrew Tomkins},
  booktitle={WWW '04},
The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In addition to just individual pages, collections of pages or even entire neighborhoods of the web exhibit significant decay, rendering them less effective as… 

Figures and Tables from this paper

A method for measuring the evolution of a topic on the Web: The case of “informetrics”
A method for tracking topics on the Web for long periods of time, without the need to employ a crawler and relying only on publicly available resources is developed.
The web beyond popularity: a really simple system for web scale RSS
The "Daily Deltas" (Delta) application is able to provide an informative feed of relevant content directly to a user, allowing individuals to track their interests independent of the overall popularity of the topic.
A Pocket Guide to Web History
This paper addresses the requirement and proposes rank synopses as a novel structure to compactly represent and reconstruct historical PageRank scores and devise a normalization scheme forPageRank scores to make them comparable across different graphs.
Using the web infrastructure for real time recovery of missing web pages
A temporal study of the decay of lexical signatures and titles and estimate their half-life is conducted, and the use of tags that users have created to annotate pages as well as the most salient terms derived from a page's link neighborhood are proposed.
Bringing your dead links back to life: a comprehensive approach and lessons learned
An algorithm is developed that incorporates a comprehensive set of heuristics that succeeds in correctly finding new links for more than 70% of broken links at 95% confidence level and is demonstrated empirically that the problem of searching for moved pages is different from typical information retrieval problems.
Vetting the links of the web
A general classification model is built, primarily using local and global temporal features extracted from historical content, topic, link and time-focused changes over time, that could be useful for various applications that depend on analysis of web links, including ranking and crawling.
What's really new on the web?: identifying new pages from a series of unstable web snapshots
Using a novelty measure for estimating the certainty that a newly crawled page appeared between the previous and current crawls, new pages can be extracted from a series of unstable snapshots for further analysis and mining to identify new trends on the Web.
Detecting Off-Topic Pages in Web Archives
This paper evaluates six different methods to detect when the page has gone off-topic through subsequent captures in Web archive collections and found that combining cosine similarity at threshold 0.10 and change in size using word count at threshold \(-\)0.85 performs the best.
Moved but not gone: an evaluation of real-time methods for discovering replacement web pages
Analysis of four content- and link-based methods to rediscover missing Web pages indicates that Web pages are often not completely lost but have moved to a different location and “just” need to be rediscovered.
Analyzing the Perceptions of Change in a Distributed Collection of Web Documents
A case study is presented that compares change detection methods based on machine learning algorithms against the assessment made by human subjects in a user study on how these methods fare against the human assessment of change in the ACM conference list.


What's new on the web?: the evolution of the web from a search engine perspective
The authors' findings indicate a rapid turnover rate of Web pages, i.e., high rates of birth and death, coupled with an even higher rate ofturnover in the hyperlinks that connect them, which is likely to remain consistent over time.
An Analysis of Web Page and Web Site Constancy and Permanence
  • W. Koehler
  • Computer Science
    J. Am. Soc. Inf. Sci.
  • 1999
We recognize that documents on the World Wide Web are ephemeral and changing. We also recognize that Web documents can be categorized along a number of dimensions, including “publisher,” size, object
A large-scale study of the evolution of web pages
It is found that the average degree of change varies widely across top-level domains, and that larger pages change more often and more severely than smaller ones.
How dynamic is the Web?
Methods for Sampling Pages Uniformly from the World Wide Web
Two new algorithms for generating uniformly random samples of pages from the World Wide Web are presented, building upon recent work by Henzinger et al. (2000) and Bar-Yossef et al (2000), based on a weighted random-walk methodology.
Optimal crawling strategies for web search engines
A two-part scheme, based on network flow theory, that determines the (nearly) optimal crawling frequencies, as well as the theoretically optimal times to crawl each web page, within an extremely general stochastic framework.
Digital libraries and World Wide Web sites and page persistence
The paper examines how Web documents can be efficiently and effectively incorporated into library collections and concludes that the Web is not a digital library, but its component parts can be aggregated and included as parts of digital library collections.