Web crawler middleware for search engine digital libraries: a case study for citeseerX

@inproceedings{Wu2012WebCM,
  title={Web crawler middleware for search engine digital libraries: a case study for citeseerX},
  author={Jian Wu and P. Teregowda and Madian Khabsa and Stephen Carman and Douglas Jordan and J. S. P. Wandelmer and Xin Lu and P. Mitra and C. Lee Giles},
  booktitle={WIDM '12},
  year={2012}
}
Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document Importer (CDI), which selectively imports documents and the associated metadata to the digital library CiteSeerX crawl repository and database. This middleware is designed to be extensible as it provides a universal interface to the crawl database. It is designed to support input from multiple open source crawlers and archival formats, e.g., ARC, WARC. It can also import… Expand
Scalability Bottlenecks of the CiteSeerX Digial Library Search Engine
As the document collection and user population increase, the capability and performance of a digital library such as CiteSeerX maybe limited by some bottlenecks. This paper describes the currentExpand
and Information Science University of North Texas , Denton , TX 76203
The field of Web Archiving exists in a fluid, fragmented, and heterogeneous state. Part of the problem is that this field is relatively new and its literature is scattered across a wide range ofExpand
Study of WEBCRAWLING Polices
TLDR
In this paper studied has been done on the various issues important for designing high performance system the performances and outcomes are determined by the given factors under the summarization criteria. Expand
A Method for Integrating Bibliographic Data from OAI-PMH Data Providers
TLDR
This paper introduces a method for integrating research articles in PDF format with their corresponding bibliographic metadata extracted from OAI-PMH data providers, and carries out a prototype based on wrappers to extract, store and link, researcharticles in PDFformat with their corresponds bibliographical metadata in a database. Expand
Big Scholarly Data in CiteSeerX: Information Extraction from the Web
We examine CiteSeerX, an intelligent system designed with the goal of automatically acquiring and organizing large-scale collections of scholarly documents from the world wide web. From theExpand
Middleware Technologies for Cloud of Things - a survey
TLDR
The main aim of this paper is to study the middleware technologies for CoT, and presents the main features and characteristics of middlewares, which include different architecture styles and service domains, and a list of current challenges and issues in the design of CoT-basedmiddlewares. Expand
Purposeful Searching for Citations of Scholarly Publications
TLDR
Search strategies will be developed and evaluated in this work in order to reduce the costs for the analysis of documents without citations to the given set of publications. Expand
Towards building a collection of web archiving research articles
TLDR
This presentation discusses building a collection of web archiving research articles and the challenges of doing so in the rapidly changing environment. Expand

References

SHOWING 1-5 OF 5 REFERENCES
SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web
TLDR
The design of SeerSuite is described and the deployment and usage of CiteSeerx is described as an instance of SearSuite, which enables access to extensive document, citation, and author metadata by automatically extracting, storing and indexing metadata. Expand
The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists
TLDR
It is found that crawling the whitelist significantly increases the crawl precision by reducing a large amount of irrelevant requests and downloads. Expand
The Definitive Guide to Django: Web Development Done Right, Second Edition
This latest edition of The Definitive Guide to Django is updated for Django 1.1, and, with the forwardcompatibility guarantee that Django now provides, should serve as the ultimate tutorial andExpand
Introduction to Heritrix, an archival quality web crawler
  • Proceedings of the 4th International Web Archiving Workshop (IWAW'04)
  • 2004
The Definitive Guide to Django: Web Development Done Right (Pro)
  • The Definitive Guide to Django: Web Development Done Right (Pro)