The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists

@inproceedings{Wu2012TheEO,
  title={The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists},
  author={Jian Wu and P. Teregowda and Juan Pablo Fern{\'a}ndez Ram{\'i}rez and P. Mitra and Shuyi Zheng and C. Lee Giles},
  booktitle={WebSci '12},
  year={2012}
}
We present a preliminary study of the evolution of a crawling strategy for an academic document search engine, in particular CiteSeerX. CiteSeerX actively crawls the web for academic and research documents primarily in computer and information sciences, and then performs unique information extraction and indexing extracting information such as OAI metadata, citations, tables and others. As such CiteSeerX could be considered a specialty or vertical search engine. To improve precision in… Expand
Scalability Bottlenecks of the CiteSeerX Digial Library Search Engine
As the document collection and user population increase, the capability and performance of a digital library such as CiteSeerX maybe limited by some bottlenecks. This paper describes the currentExpand
Web crawler middleware for search engine digital libraries: a case study for citeseerX
TLDR
A middleware, the Crawl Document Importer (CDI), which selectively imports documents and the associated metadata to the digital library CiteSeerX crawl repository and database is developed, designed to be extensible as it provides a universal interface to the crawl database. Expand
Bayes topic prediction model for focused crawling of vertical search engine
  • Weihong Zhang, Yong Chen
  • Computer Science
  • 2014 IEEE Computers, Communications and IT Applications Conference
  • 2014
TLDR
A new information resource discovery model is proposed and a crawler in the vertical search engine, which can selectively fetch webpages relevant to a pre-defined topic, is built, which shows that the average prediction accuracy of the proposed model can reach more than 85%. Expand
Finding seeds to bootstrap focused crawlers
TLDR
It is shown that the seeds can greatly influence the performance of crawlers, and a new framework for automatically finding seeds is proposed, which results in higher harvest rates and an improved topic coverage by providing crawlers a seed set that is large and varied. Expand
CiteSeerX: AI in a Digital Library Search Engine
TLDR
This work presents key AI technologies used in the following components: document classification and de-duplication, document and citation clustering, automatic metadata extraction and indexing, and author disambiguation in CiteSeerX. Expand
Scholarly big data information extraction and integration in the CiteSeerχ digital library
TLDR
This paper describes how CiteSeerχ aggregate data from multiple sources on the Web; store and manage data; process data as part of an automatic ingestion pipeline that includes automatic metadata and information extraction; perform document and citation clustering; perform entity linking and name disambiguation; and make data and source code available to enable research and collaboration. Expand
Modeling Updates of Scholarly Webpages Using Archived Data
TLDR
The utility of archived data to optimize the crawling strategy of web crawlers is demonstrated, important challenges that inspire future research directions are uncovered, and an approach for modeling the dynamics of change in the web using archived copies of webpages is proposed. Expand
Term frequency-information content for focused crawling to predict relevant web pages.
TLDR
By considering terms’ information contents, this paper proposes Term Frequency-Information Content (TF-IC) method which assigns appropriate weight to each term in a multi-term topic. Expand
Searching online book documents and analyzing book citations
TLDR
This work proposes a hybrid approach for extracting title and authors from a book that combines results from CiteSeer, a rule based extractor, and a SVM based Extractor, leveraging web knowledge and introduces an open-book search engine that extracts and indexes metadata, contents, and bibliography from online PDF book documents. Expand
Utility-Based Control Feedback in a Digital Library Search Engine: Cases in CiteSeerX
We describe a utility-based feedback control model and its applications within an open access digital library search engine – CiteSeerX, the new version of CiteSeer. CiteSeerX leverages user-basedExpand
...
1
2
3
...

References

SHOWING 1-8 OF 8 REFERENCES
RankMass Crawler: A Crawler with High PageRank Coverage Guarantee
TLDR
This paper develops a family of crawling algorithms that provide a theoretical guarantee on how much of the “important” part of the Web it will download after crawling a certain number of pages and give a high priority to important pages during a crawl, so that the search engine can index the most important part ofThe Web first. Expand
RankMass crawler: a crawler with high personalized pagerank coverage guarantee
TLDR
This paper develops a family of crawling algorithms that provide a theoretical guarantee on how much of the "important" part of the Web it will download after crawling a certain number of pages and give a high priority to important pages during a crawl, so that the search engine can index the most important part ofThe Web first. Expand
Graph-based seed selection for web-scale crawlers
TLDR
This paper proposes a graph-based framework for crawler seed selection, and presents several algorithms within this framework that showed significant improvements over heuristic seed selection approaches. Expand
Crawling the Infinite Web
TLDR
Several probabilistic models for user browsing in "infinite" Web sites are proposed and studied, aimed at predicting how deep users go while exploring Web sites, and validated against real data on page views in several Web sites. Expand
Efficient Crawling Through URL Ordering
TLDR
This paper studies in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first, and shows that a Crawler with a good ordering scheme can obtain important pages significantly faster than one without. Expand
Web Crawling
TLDR
The fundamental challenges of web crawling are outlined and the state-of-the-art models and solutions are described, and avenues for future work are highlighted. Expand
Foundations and Trends in Information Retrieval
  • Foundations and Trends in Information Retrieval
  • 2010
Crawling the infinite web.J
  • Web Eng
  • 2007