The discoverability of the web

@inproceedings{Dasgupta2007TheDO,
  title={The discoverability of the web},
  author={Anirban Dasgupta and Arpita Ghosh and Ravi Kumar and Christopher Olston and Sandeep Pandey and Andrew Tomkins},
  booktitle={WWW '07},
  year={2007}
}
Previous studies have highlighted the high arrival rate of new contenton the web. We study the extent to which this new content can beefficiently discovered by a crawler. Our study has two parts. First,we study the inherent difficulty of the discovery problem using amaximum cover formulation, under an assumption of perfect estimates oflikely sources of links to new content. Second, we relax thisassumption and study a more realistic setting in which algorithms mustuse historical statistics to… 
Crawl ordering by search impact
TLDR
A new impact-driven crawling policy is designed that ensures that the crawler acquires content relevant to "tail topics" that are obscure but of interest to some users, rather than just redundantly accumulating content on popular topics.
LiveRank: How to Refresh Old Crawls
TLDR
The results show that building on the PageRank can lead to efficient LiveRanks for Web graphs, and the quality of a LiveRank is measured by the number of queries necessary to identify a given fraction of the alive pages when using the LiveRank order.
Essential Web Pages Are Easy to Find
In this paper we address the problem of estimating the index size needed by web search engines to answer as many queries as possible by exploiting the marked difference between query and click
Learning to Discover Domain-Specific Web Content
TLDR
New methods for efficient domain-specific re-crawling that maximize the yield for new content by learning patterns of pages that have a high yield are proposed, which can achieve 150% higher coverage compared to existing, state-of-the-art techniques.
A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking
TLDR
This paper proposes a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download that takes into account the pages' potential impact on user-perceived search quality, and proposes a link graph enrichment technique that extends this solution.
Novel approaches to crawling important pages early
TLDR
This paper proposes a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric, and proposes a large-scale experiment to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms.
A First Study on Temporal Dynamics of Topics on the Web
TLDR
The preliminary efforts in building a testbed to better understand the dynamics of specific topics and characterize how they evolve over time and the results suggest that topic-specific refreshing strategies can be beneficial for focused crawlers.
Topical Discovery of Web Content
This work describes the theory and the implementation of a new software tool, the "Web Topical Discovery System" (WTDS), which provides an approach to the automatic discovery and selection of new web
LEARNING TO SCHEDULE WEB PAGE UPDATES USING GENETIC PROGRAMMING
TLDR
A flexible framework that uses Genetic Programming to evolve score functions to estimate the likelihood that a web page has been modified is proposed and a thorough experimental evaluation of the benefits of using the framework over five state-of-the-art baselines is presented.
Measuring the Search Effectiveness of a Breadth-First Crawl
TLDR
Having observed that NDCG@100 (measured over a set of reference queries) begins to plateau in the initial stages of the crawl, a number of possible reasons are investigated, including the web-pages themselves, the metric used to measure retrieval effectiveness as well as the set of relevance judgements used.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 32 REFERENCES
Rate of Change and other Metrics: a Live Study of the World Wide Web
TLDR
The potential benefit of a shared proxy-caching server in a large environment is quantified by using traces that were collected at the Internet connection points for two large corporations, representing significant numbers of references.
What's new on the web?: the evolution of the web from a search engine perspective
TLDR
The authors' findings indicate a rapid turnover rate of Web pages, i.e., high rates of birth and death, coupled with an even higher rate ofturnover in the hyperlinks that connect them, which is likely to remain consistent over time.
Optimal crawling strategies for web search engines
TLDR
A two-part scheme, based on network flow theory, that determines the (nearly) optimal crawling frequencies, as well as the theoretically optimal times to crawl each web page, within an extremely general stochastic framework.
Ranking the web frontier
TLDR
This paper analyzes features of the rapidly growing "frontier" of the web, namely the part of theweb that crawlers are unable to cover for one reason or another, and suggests ways to improve the quality of ranking by modeling the growing presence of "link rot" on the web as more sites and pages fall out of maintenance.
The Evolution of the Web and Implications for an Incremental Crawler
TLDR
An architecture for the incremental crawler is proposed, which combines the best design choices, which can improve the ``freshness'' of the collection significantly and bring in new pages in a more timely manner.
On the evolution of clusters of near-duplicate Web pages
  • Dennis Fetterly, M. Manasse, Marc Najork
  • Computer Science
    Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726)
  • 2003
TLDR
A 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web is expanded, and it is found that 29.2% of all Web pages are very similar to other pages, and that 22.
User-centric Web crawling
TLDR
The results demonstrate that the user-centric method requires far fewer resources to maintain same search engine quality level for users, leaving substantially more resources available for incorporating new Web pages into the search repository.
An adaptive model for optimizing performance of an incremental web crawler
TLDR
This paper outlines the design of a web crawler implemented for IBM Almaden's WebFountain project and describes an optimization model for controlling the crawl strategy and shows that there are compromise objectives which lead to good strategies that are robust against a number of criteria.
Trawling the Web for Emerging Cyber-Communities
Efficient Crawling Through URL Ordering
...
1
2
3
4
...