The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists

@inproceedings{Wu2012TheEO,
  title={The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists},
  author={Jian Wu and Pradeep B. Teregowda and Juan Pablo Fern{\'a}ndez Ram{\'i}rez and Prasenjit Mitra and Shuyi Zheng and C. Lee Giles},
  booktitle={WebSci '12},
  year={2012}
}
We present a preliminary study of the evolution of a crawling strategy for an academic document search engine, in particular CiteSeerX. CiteSeerX actively crawls the web for academic and research documents primarily in computer and information sciences, and then performs unique information extraction and indexing extracting information such as OAI metadata, citations, tables and others. As such CiteSeerX could be considered a specialty or vertical search engine. To improve precision in… 

Figures and Tables from this paper

An Approach for Focused Crawler to Harvest Digital Academic Documents in Online Digital Libraries

With therapidﻷ growthﻵ�growth﻽�ofﻴdigital-digital- digital-digitalﻹ information-and-user-needs, £1.5bn-worth of assets are expected to be created within the next 12 months.

Scalability Bottlenecks of the CiteSeerX Digial Library Search Engine

The current infrastructure of the CiteSeerX academic digital library search engine is described, its current bottlenecks are outlined, feasible solutions are proposed and under testing or on the roadmap are proposed.

CiteSeerX: AI in a Digital Library Search Engine

This work presents key AI technologies used in the following components of CiteSeerX: document classification and deduplication, document and citation clustering, automatic metadata extraction and indexing, and author disambiguation.

Web crawler middleware for search engine digital libraries: a case study for citeseerX

A middleware, the Crawl Document Importer (CDI), which selectively imports documents and the associated metadata to the digital library CiteSeerX crawl repository and database is developed, designed to be extensible as it provides a universal interface to the crawl database.

Bayes topic prediction model for focused crawling of vertical search engine

  • Weihong ZhangYong Chen
  • Computer Science
    2014 IEEE Computers, Communications and IT Applications Conference
  • 2014
A new information resource discovery model is proposed and a crawler in the vertical search engine, which can selectively fetch webpages relevant to a pre-defined topic, is built, which shows that the average prediction accuracy of the proposed model can reach more than 85%.

Finding seeds to bootstrap focused crawlers

It is shown that the seeds can greatly influence the performance of crawlers, and a new framework for automatically finding seeds is proposed, which results in higher harvest rates and an improved topic coverage by providing crawlers a seed set that is large and varied.

Scholarly big data information extraction and integration in the CiteSeerχ digital library

This paper describes how CiteSeerχ aggregate data from multiple sources on the Web; store and manage data; process data as part of an automatic ingestion pipeline that includes automatic metadata and information extraction; perform document and citation clustering; perform entity linking and name disambiguation; and make data and source code available to enable research and collaboration.

Modeling Updates of Scholarly Webpages Using Archived Data

The utility of archived data to optimize the crawling strategy of web crawlers is demonstrated, important challenges that inspire future research directions are uncovered, and an approach for modeling the dynamics of change in the web using archived copies of webpages is proposed.

Searching online book documents and analyzing book citations

This work proposes a hybrid approach for extracting title and authors from a book that combines results from CiteSeer, a rule based extractor, and a SVM based Extractor, leveraging web knowledge and introduces an open-book search engine that extracts and indexes metadata, contents, and bibliography from online PDF book documents.

Utility-Based Control Feedback in a Digital Library Search Engine: Cases in CiteSeerX

We describe a utility-based feedback control model and its applications within an open access digital library search engine – CiteSeerX, the new version of CiteSeer. CiteSeerX leverages user-based

References

SHOWING 1-8 OF 8 REFERENCES

RankMass Crawler: A Crawler with High PageRank Coverage Guarantee

This paper develops a family of crawling algorithms that provide a theoretical guarantee on how much of the “important” part of the Web it will download after crawling a certain number of pages and give a high priority to important pages during a crawl, so that the search engine can index the most important part ofThe Web first.

RankMass crawler: a crawler with high personalized pagerank coverage guarantee

This paper develops a family of crawling algorithms that provide a theoretical guarantee on how much of the "important" part of the Web it will download after crawling a certain number of pages and give a high priority to important pages during a crawl, so that the search engine can index the most important part ofThe Web first.

Graph-based seed selection for web-scale crawlers

This paper proposes a graph-based framework for crawler seed selection, and presents several algorithms within this framework that showed significant improvements over heuristic seed selection approaches.

Crawling the Infinite Web

Several probabilistic models for user browsing in "infinite" Web sites are proposed and studied, aimed at predicting how deep users go while exploring Web sites, and validated against real data on page views in several Web sites.

Efficient Crawling Through URL Ordering

Web Crawling

The fundamental challenges of web crawling are outlined and the state-of-the-art models and solutions are described, and avenues for future work are highlighted.

Crawling the infinite web.J

  • Web Eng
  • 2007

Foundations and Trends in Information Retrieval

  • Foundations and Trends in Information Retrieval
  • 2010