The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists

@inproceedings{Wu2012TheEO,
  title={The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists},
  author={Jian Wu and Pradeep B. Teregowda and J. Ram{\'i}rez and P. Mitra and Shuyi Zheng and C. Lee Giles},
  booktitle={WebSci '12},
  year={2012}
}
  • Jian Wu, Pradeep B. Teregowda, +3 authors C. Lee Giles
  • Published in WebSci '12 2012
  • Computer Science
  • We present a preliminary study of the evolution of a crawling strategy for an academic document search engine, in particular CiteSeerX. CiteSeerX actively crawls the web for academic and research documents primarily in computer and information sciences, and then performs unique information extraction and indexing extracting information such as OAI metadata, citations, tables and others. As such CiteSeerX could be considered a specialty or vertical search engine. To improve precision in… CONTINUE READING
    CiteSeerX: AI in a Digital Library Search Engine
    • 62
    • PDF
    Finding seeds to bootstrap focused crawlers
    • 14
    Scalability Bottlenecks of the CiteSeerX Digial Library Search Engine
    • 1
    • PDF
    Searching online book documents and analyzing book citations
    • 11
    • PDF
    The Quest for Research Information
    • 6
    • PDF
    Term frequency-information content for focused crawling to predict relevant web pages.
    • 1