Sic transit gloria telae: towards an understanding of the web's decay

@inproceedings{BarYossef2004SicTG,
  title={Sic transit gloria telae: towards an understanding of the web's decay},
  author={Ziv Bar-Yossef and A. Broder and Ravi Kumar and A. Tomkins},
  booktitle={WWW '04},
  year={2004}
}
The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In addition to just individual pages, collections of pages or even entire neighborhoods of the web exhibit significant decay, rendering them less effective as… Expand
A method for measuring the evolution of a topic on the Web: The case of “informetrics”
TLDR
A method for tracking topics on the Web for long periods of time, without the need to employ a crawler and relying only on publicly available resources is developed. Expand
The web beyond popularity: a really simple system for web scale RSS
TLDR
The "Daily Deltas" (Delta) application is able to provide an informative feed of relevant content directly to a user, allowing individuals to track their interests independent of the overall popularity of the topic. Expand
A Pocket Guide to Web History
TLDR
This paper addresses the requirement and proposes rank synopses as a novel structure to compactly represent and reconstruct historical PageRank scores and devise a normalization scheme forPageRank scores to make them comparable across different graphs. Expand
Using the web infrastructure for real time recovery of missing web pages
TLDR
A temporal study of the decay of lexical signatures and titles and estimate their half-life is conducted, and the use of tags that users have created to annotate pages as well as the most salient terms derived from a page's link neighborhood are proposed. Expand
Bringing your dead links back to life: a comprehensive approach and lessons learned
TLDR
An algorithm is developed that incorporates a comprehensive set of heuristics that succeeds in correctly finding new links for more than 70% of broken links at 95% confidence level and is demonstrated empirically that the problem of searching for moved pages is different from typical information retrieval problems. Expand
Vetting the links of the web
TLDR
A general classification model is built, primarily using local and global temporal features extracted from historical content, topic, link and time-focused changes over time, that could be useful for various applications that depend on analysis of web links, including ranking and crawling. Expand
What's really new on the web?: identifying new pages from a series of unstable web snapshots
TLDR
Using a novelty measure for estimating the certainty that a newly crawled page appeared between the previous and current crawls, new pages can be extracted from a series of unstable snapshots for further analysis and mining to identify new trends on the Web. Expand
Detecting Off-Topic Pages in Web Archives
TLDR
This paper evaluates six different methods to detect when the page has gone off-topic through subsequent captures in Web archive collections and found that combining cosine similarity at threshold 0.10 and change in size using word count at threshold \(-\)0.85 performs the best. Expand
Moved but not gone: an evaluation of real-time methods for discovering replacement web pages
TLDR
Analysis of four content- and link-based methods to rediscover missing Web pages indicates that Web pages are often not completely lost but have moved to a different location and “just” need to be rediscovered. Expand
Analyzing the Perceptions of Change in a Distributed Collection of Web Documents
TLDR
A case study is presented that compares change detection methods based on machine learning algorithms against the assessment made by human subjects in a user study on how these methods fare against the human assessment of change in the ACM conference list. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 36 REFERENCES
What's new on the web?: the evolution of the web from a search engine perspective
TLDR
The authors' findings indicate a rapid turnover rate of Web pages, i.e., high rates of birth and death, coupled with an even higher rate ofturnover in the hyperlinks that connect them, which is likely to remain consistent over time. Expand
A large-scale study of the evolution of web pages
TLDR
It is found that the average degree of change varies widely across top-level domains, and that larger pages change more often and more severely than smaller ones. Expand
The Anatomy of a Large-Scale Hypertextual Web Search Engine
TLDR
This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. Expand
Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery
TLDR
A new hypertext resource discovery system called a Focused Crawler that is robust against large perturbations in the starting set of URLs, and capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius. Expand
How dynamic is the Web?
TLDR
Using empirical models and a novel analytic metric of `up-to-dateness', the rate at which Web search engines must re-index the Web to remain current is estimated. Expand
The Connectivity Server: Fast Access to Linkage Information on the Web
TLDR
A server that provides linkage information for all pages indexed by the AltaVista search engine and can produce the entire neighbourhood of L up to a given distance, and envisage numerous other applications such as ranking, visualization, and classification. Expand
Methods for Sampling Pages Uniformly from the World Wide Web
TLDR
Two new algorithms for generating uniformly random samples of pages from the World Wide Web are presented, building upon recent work by Henzinger et al. (2000) and Bar-Yossef et al (2000), based on a weighted random-walk methodology. Expand
Optimal crawling strategies for web search engines
TLDR
A two-part scheme, based on network flow theory, that determines the (nearly) optimal crawling frequencies, as well as the theoretically optimal times to crawl each web page, within an extremely general stochastic framework. Expand
Digital libraries and World Wide Web sites and page persistence
TLDR
The paper examines how Web documents can be efficiently and effectively incorporated into library collections and concludes that the Web is not a digital library, but its component parts can be aggregated and included as parts of digital library collections. Expand
Using PageRank to Characterize Web Structure
TLDR
It is suggested that PageRank values on the web follow a power law, and generative models for the web graph are developed that explain this observation and moreover remain faithful to previously studied degree distributions. Expand
...
1
2
3
4
...