Web crawler middleware for search engine digital libraries: a case study for citeseerX

@inproceedings{Wu2012WebCM,
  title={Web crawler middleware for search engine digital libraries: a case study for citeseerX},
  author={Jian Wu and Pradeep B. Teregowda and Madian Khabsa and Stephen Carman and Douglas Jordan and Jose San Pedro Wandelmer and Xin Lu and Prasenjit Mitra and C. Lee Giles},
  booktitle={WIDM '12},
  year={2012}
}
Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document Importer (CDI), which selectively imports documents and the associated metadata to the digital library CiteSeerX crawl repository and database. This middleware is designed to be extensible as it provides a universal interface to the crawl database. It is designed to support input from multiple open source crawlers and archival formats, e.g., ARC, WARC. It can also import… 

Scalability Bottlenecks of the CiteSeerX Digial Library Search Engine

The current infrastructure of the CiteSeerX academic digital library search engine is described, its current bottlenecks are outlined, feasible solutions are proposed and under testing or on the roadmap are proposed.

Building an Accessible, Usable, Scalable, and Sustainable Service for Scholarly Big Data

This paper reviews the design, implementation, and operation experiences, and lessons of CiteSeerX, a real-world digital library search engine, and proposed a new design with a revised architecture, enhanced hardware, and software infrastructure.

and Information Science University of North Texas , Denton , TX 76203

This paper presents a process, grounded in information retrieval and machine learning techniques, for gathering a corpus of literature about an emerging field of Web Archiving, and presents an approach to building a collection of articles about the subject.

Crawling and Mining Social Media Networks: A Facebook Case

An automated system that takes a Facebook user as an input, extracts recursively the list of friends for this user, and returns the friends information (name, university, etc.).

Study of WEBCRAWLING Polices

In this paper studied has been done on the various issues important for designing high performance system the performances and outcomes are determined by the given factors under the summarization criteria.

A Method for Integrating Bibliographic Data from OAI-PMH Data Providers

This paper introduces a method for integrating research articles in PDF format with their corresponding bibliographic metadata extracted from OAI-PMH data providers, and carries out a prototype based on wrappers to extract, store and link, researcharticles in PDFformat with their corresponds bibliographical metadata in a database.

Big Scholarly Data in CiteSeerX: Information Extraction from the Web

We examine CiteSeerX, an intelligent system designed with the goal of automatically acquiring and organizing large-scale collections of scholarly documents from the world wide web. From the

Purposeful Searching for Citations of Scholarly Publications

Search strategies will be developed and evaluated in this work in order to reduce the costs for the analysis of documents without citations to the given set of publications.

Towards building a collection of web archiving research articles

This presentation discusses building a collection of web archiving research articles and the challenges of doing so in the rapidly changing environment.

References

SHOWING 1-5 OF 5 REFERENCES

SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web

The design of SeerSuite is described and the deployment and usage of CiteSeerx is described as an instance of SearSuite, which enables access to extensive document, citation, and author metadata by automatically extracting, storing and indexing metadata.

The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists

It is found that crawling the whitelist significantly increases the crawl precision by reducing a large amount of irrelevant requests and downloads.

The Definitive Guide to Django: Web Development Done Right, Second Edition

This latest edition of The Definitive Guide to Django is updated for Django 1.1, and, with the forwardcompatibility guarantee that Django now provides, should serve as the ultimate tutorial and

Introduction to Heritrix, an archival quality web crawler

  • Proceedings of the 4th International Web Archiving Workshop (IWAW'04)
  • 2004

The Definitive Guide to Django: Web Development Done Right (Pro)

  • The Definitive Guide to Django: Web Development Done Right (Pro)