The discoverability of the web
@inproceedings{Dasgupta2007TheDO, title={The discoverability of the web}, author={Anirban Dasgupta and Arpita Ghosh and Ravi Kumar and Christopher Olston and Sandeep Pandey and Andrew Tomkins}, booktitle={WWW '07}, year={2007} }
Previous studies have highlighted the high arrival rate of new contenton the web. We study the extent to which this new content can beefficiently discovered by a crawler. Our study has two parts. First,we study the inherent difficulty of the discovery problem using amaximum cover formulation, under an assumption of perfect estimates oflikely sources of links to new content. Second, we relax thisassumption and study a more realistic setting in which algorithms mustuse historical statistics to…
Figures and Tables from this paper
73 Citations
Crawl ordering by search impact
- Computer ScienceWSDM '08
- 2008
A new impact-driven crawling policy is designed that ensures that the crawler acquires content relevant to "tail topics" that are obscure but of interest to some users, rather than just redundantly accumulating content on popular topics.
LiveRank: How to Refresh Old Crawls
- Computer ScienceWAW
- 2014
The results show that building on the PageRank can lead to efficient LiveRanks for Web graphs, and the quality of a LiveRank is measured by the number of queries necessary to identify a given fraction of the alive pages when using the LiveRank order.
Essential Web Pages Are Easy to Find
- Computer ScienceWWW
- 2015
In this paper we address the problem of estimating the index size needed by web search engines to answer as many queries as possible by exploiting the marked difference between query and click…
Learning to Discover Domain-Specific Web Content
- Computer ScienceWSDM
- 2018
New methods for efficient domain-specific re-crawling that maximize the yield for new content by learning patterns of pages that have a high yield are proposed, which can achieve 150% higher coverage compared to existing, state-of-the-art techniques.
A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking
- Computer ScienceSIGIR
- 2015
This paper proposes a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download that takes into account the pages' potential impact on user-perceived search quality, and proposes a link graph enrichment technique that extends this solution.
Novel approaches to crawling important pages early
- Computer ScienceKnowledge and Information Systems
- 2012
This paper proposes a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric, and proposes a large-scale experiment to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms.
A First Study on Temporal Dynamics of Topics on the Web
- Computer ScienceWWW
- 2016
The preliminary efforts in building a testbed to better understand the dynamics of specific topics and characterize how they evolve over time and the results suggest that topic-specific refreshing strategies can be beneficial for focused crawlers.
Topical Discovery of Web Content
- Computer ScienceArXiv
- 2015
This work describes the theory and the implementation of a new software tool, the "Web Topical Discovery System" (WTDS), which provides an approach to the automatic discovery and selection of new web…
LEARNING TO SCHEDULE WEB PAGE UPDATES USING GENETIC PROGRAMMING
- Computer Science
A flexible framework that uses Genetic Programming to evolve score functions to estimate the likelihood that a web page has been modified is proposed and a thorough experimental evaluation of the benefits of using the framework over five state-of-the-art baselines is presented.
Measuring the Search Effectiveness of a Breadth-First Crawl
- Computer ScienceECIR
- 2009
Having observed that NDCG@100 (measured over a set of reference queries) begins to plateau in the initial stages of the crawl, a number of possible reasons are investigated, including the web-pages themselves, the metric used to measure retrieval effectiveness as well as the set of relevance judgements used.
References
SHOWING 1-10 OF 32 REFERENCES
Rate of Change and other Metrics: a Live Study of the World Wide Web
- Computer ScienceUSENIX Symposium on Internet Technologies and Systems
- 1997
The potential benefit of a shared proxy-caching server in a large environment is quantified by using traces that were collected at the Internet connection points for two large corporations, representing significant numbers of references.
What's new on the web?: the evolution of the web from a search engine perspective
- Computer ScienceWWW '04
- 2004
The authors' findings indicate a rapid turnover rate of Web pages, i.e., high rates of birth and death, coupled with an even higher rate ofturnover in the hyperlinks that connect them, which is likely to remain consistent over time.
Optimal crawling strategies for web search engines
- Computer ScienceWWW '02
- 2002
A two-part scheme, based on network flow theory, that determines the (nearly) optimal crawling frequencies, as well as the theoretically optimal times to crawl each web page, within an extremely general stochastic framework.
Ranking the web frontier
- Computer ScienceWWW '04
- 2004
This paper analyzes features of the rapidly growing "frontier" of the web, namely the part of theweb that crawlers are unable to cover for one reason or another, and suggests ways to improve the quality of ranking by modeling the growing presence of "link rot" on the web as more sites and pages fall out of maintenance.
The Evolution of the Web and Implications for an Incremental Crawler
- Computer ScienceVLDB
- 2000
An architecture for the incremental crawler is proposed, which combines the best design choices, which can improve the ``freshness'' of the collection significantly and bring in new pages in a more timely manner.
On the evolution of clusters of near-duplicate Web pages
- Computer ScienceProceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726)
- 2003
A 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web is expanded, and it is found that 29.2% of all Web pages are very similar to other pages, and that 22.
User-centric Web crawling
- Computer ScienceWWW '05
- 2005
The results demonstrate that the user-centric method requires far fewer resources to maintain same search engine quality level for users, leaving substantially more resources available for incorporating new Web pages into the search repository.
An adaptive model for optimizing performance of an incremental web crawler
- Computer ScienceWWW '01
- 2001
This paper outlines the design of a web crawler implemented for IBM Almaden's WebFountain project and describes an optimization model for controlling the crawl strategy and shows that there are compromise objectives which lead to good strategies that are robust against a number of criteria.