Sic transit gloria telae: towards an understanding of the web's decay
@inproceedings{BarYossef2004SicTG, title={Sic transit gloria telae: towards an understanding of the web's decay}, author={Ziv Bar-Yossef and Andrei Z. Broder and Ravi Kumar and Andrew Tomkins}, booktitle={WWW '04}, year={2004} }
The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In addition to just individual pages, collections of pages or even entire neighborhoods of the web exhibit significant decay, rendering them less effective as…
151 Citations
A method for measuring the evolution of a topic on the Web: The case of “informetrics”
- Computer Science
- 2009
A method for tracking topics on the Web for long periods of time, without the need to employ a crawler and relying only on publicly available resources is developed.
The web beyond popularity: a really simple system for web scale RSS
- Computer ScienceWWW '06
- 2006
The "Daily Deltas" (Delta) application is able to provide an informative feed of relevant content directly to a user, allowing individuals to track their interests independent of the overall popularity of the topic.
A Pocket Guide to Web History
- Computer ScienceSPIRE
- 2007
This paper addresses the requirement and proposes rank synopses as a novel structure to compactly represent and reconstruct historical PageRank scores and devise a normalization scheme forPageRank scores to make them comparable across different graphs.
Using the web infrastructure for real time recovery of missing web pages
- Computer Science
- 2011
A temporal study of the decay of lexical signatures and titles and estimate their half-life is conducted, and the use of tags that users have created to annotate pages as well as the most salient terms derived from a page's link neighborhood are proposed.
Bringing your dead links back to life: a comprehensive approach and lessons learned
- Computer ScienceHT '09
- 2009
An algorithm is developed that incorporates a comprehensive set of heuristics that succeeds in correctly finding new links for more than 70% of broken links at 95% confidence level and is demonstrated empirically that the problem of searching for moved pages is different from typical information retrieval problems.
Vetting the links of the web
- Computer ScienceCIKM
- 2009
A general classification model is built, primarily using local and global temporal features extracted from historical content, topic, link and time-focused changes over time, that could be useful for various applications that depend on analysis of web links, including ranking and crawling.
What's really new on the web?: identifying new pages from a series of unstable web snapshots
- Computer ScienceWWW '06
- 2006
Using a novelty measure for estimating the certainty that a newly crawled page appeared between the previous and current crawls, new pages can be extracted from a series of unstable snapshots for further analysis and mining to identify new trends on the Web.
Detecting Off-Topic Pages in Web Archives
- Computer ScienceTPDL
- 2015
This paper evaluates six different methods to detect when the page has gone off-topic through subsequent captures in Web archive collections and found that combining cosine similarity at threshold 0.10 and change in size using word count at threshold \(-\)0.85 performs the best.
Moved but not gone: an evaluation of real-time methods for discovering replacement web pages
- Computer ScienceInternational Journal on Digital Libraries
- 2014
Analysis of four content- and link-based methods to rediscover missing Web pages indicates that Web pages are often not completely lost but have moved to a different location and “just” need to be rediscovered.
Analyzing the Perceptions of Change in a Distributed Collection of Web Documents
- Computer ScienceHT
- 2016
A case study is presented that compares change detection methods based on machine learning algorithms against the assessment made by human subjects in a user study on how these methods fare against the human assessment of change in the ACM conference list.
References
SHOWING 1-10 OF 34 REFERENCES
What's new on the web?: the evolution of the web from a search engine perspective
- Computer ScienceWWW '04
- 2004
The authors' findings indicate a rapid turnover rate of Web pages, i.e., high rates of birth and death, coupled with an even higher rate ofturnover in the hyperlinks that connect them, which is likely to remain consistent over time.
An Analysis of Web Page and Web Site Constancy and Permanence
- Computer ScienceJ. Am. Soc. Inf. Sci.
- 1999
We recognize that documents on the World Wide Web are ephemeral and changing. We also recognize that Web documents can be categorized along a number of dimensions, including “publisher,” size, object…
A large-scale study of the evolution of web pages
- Computer ScienceWWW '03
- 2003
It is found that the average degree of change varies widely across top-level domains, and that larger pages change more often and more severely than smaller ones.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
- Computer ScienceComput. Networks
- 1998
Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery
- Computer ScienceComput. Networks
- 1999
The Connectivity Server: Fast Access to Linkage Information on the Web
- Computer ScienceComput. Networks
- 1998
Methods for Sampling Pages Uniformly from the World Wide Web
- Computer Science
- 2001
Two new algorithms for generating uniformly random samples of pages from the World Wide Web are presented, building upon recent work by Henzinger et al. (2000) and Bar-Yossef et al (2000), based on a weighted random-walk methodology.
Optimal crawling strategies for web search engines
- Computer ScienceWWW '02
- 2002
A two-part scheme, based on network flow theory, that determines the (nearly) optimal crawling frequencies, as well as the theoretically optimal times to crawl each web page, within an extremely general stochastic framework.
Digital libraries and World Wide Web sites and page persistence
- Computer ScienceInf. Res.
- 1999
The paper examines how Web documents can be efficiently and effectively incorporated into library collections and concludes that the Web is not a digital library, but its component parts can be aggregated and included as parts of digital library collections.