Web crawler middleware for search engine digital libraries: a case study for citeseerX

@inproceedings{Wu2012WebCM,
  title={Web crawler middleware for search engine digital libraries: a case study for citeseerX},
  author={Jian Wu and Pradeep B. Teregowda and Madian Khabsa and Stephen Carman and D. Jordan and J. S. P. Wandelmer and Xin Lu and P. Mitra and C. Lee Giles},
  booktitle={WIDM '12},
  year={2012}
}
  • Jian Wu, Pradeep B. Teregowda, +6 authors C. Lee Giles
  • Published in WIDM '12 2012
  • Computer Science
  • Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document Importer (CDI), which selectively imports documents and the associated metadata to the digital library CiteSeerX crawl repository and database. This middleware is designed to be extensible as it provides a universal interface to the crawl database. It is designed to support input from multiple open source crawlers and archival formats, e.g., ARC, WARC. It can also import… CONTINUE READING
    Scalability Bottlenecks of the CiteSeerX Digial Library Search Engine
    • 1
    • PDF
    Study of WEBCRAWLING Polices
    • 1
    Towards building a collection of web archiving research articles
    • 1
    • PDF