CiteSeerx: an architecture and web service design for an academic document search engine

  title={CiteSeerx: an architecture and web service design for an academic document search engine},
  author={Huajing Li and Isaac G. Councill and Wang-Chien Lee and C. Lee Giles},
  booktitle={WWW '06},
CiteSeer is a scientific literature digital library and search engine which automatically crawls and indexes scientific documents in the field of computer and information science. After serving as a public search engine for nearly ten years, CiteSeer is starting to have scaling problems for handling of more documents, adding new feature and more users. Its monolithic architecture design prevents it from effectively making use of new web technologies and providing new services. After analyzing… 

Figures from this paper

SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web
The design of SeerSuite is described and the deployment and usage of CiteSeerx is described as an instance of SearSuite, which enables access to extensive document, citation, and author metadata by automatically extracting, storing and indexing metadata.
Computational Issues in Digital Library Search Engines
Details of a scalable and portable system built using message oriented middleware architecture with a publish/subscribe approach and can be deployed across different physical and cloud infrastructure are presented.
An entity profile schema for data integration in an academic metasearch engine
This work proposes an approach to the entity profile integration problem based on the selection of profiles with shared attributes and shows that the proposed approach enhances the performance of the metasearch engine.
Efficient Exploration of Algorithms in Scholarly Documents Using Big Data Analytics
This paper proposes a method to develop an algorithm search engine that analyzes a document to discover any algorithm that may be there in the document and extracts additional information about the algorithm.
AlgorithmSeer: A System for Extracting and Searching for Algorithms in Scholarly Big Data
A novel set of scalable techniques used by AlgorithmSeer to identify and extract algorithm representations in a heterogeneous pool of scholarly documents are proposed and hybrid machine learning approaches are proposed to discover algorithm representations.
A Novel Parallel Architecture Design of Information Retrieval System for Scientific Papers
This work proposes SciPDFindexer, distributed information retrieval system for scientific articles in PDF, which parses and extracts metadata from articles, and then indexes extracted content using the MapReduce scheme.
"Building a search engine for algorithms" by Suppawong Tuarob, Prasenjit Mitra, and C. Lee Giles with Martin Vesely as coordinator
An initial prototype of AlgorithmSeer is described, a system for extracting, indexing, and searching for algorithms in scholarly documents, and current issues and future directions, such as algorithm information extraction and classification, are discussed.
EMET : Extracting Metadata using ElementTree to Recommend Tags for Web
An algorithm to extract Metadata using ElementTree [EMET], new search methodology to provide keywords recommendation for Web user contents and the proposed EMET algorithm yields the average of 0.934 of Precision, 0.3 of Recall and 0.93 of F-Measure.
A simple taxonomy for computer science paper relationships
A simple taxonomy of relationships between research papers is proposed and it is shown how it can be used to improve retrieval of relevant papers, providing examples illustrating the potential benefits from its usage.
Reference metadata extraction from scientific papers
  • Zhixin GuoHai Jin
  • Computer Science
    2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies
  • 2011
A framework for automatic reference metadata extraction from scientific papers that can extract title, author, journal, volume, year, and page from science papers in PDF is described.


A service-oriented architecture for digital libraries
The effort presented here towards the Semantic-integration of a complex Information Retrieval system could be used as an integration model for arbitrary systems.
Digital Libraries and Autonomous Citation Indexing
Digital libraries incorporating ACI can help organize scientific literature and may significantly improve the efficiency of dissemination and feedback and speed the transition to scholarly electronic publishing.
A framework for distributed digital object services
The following paper was written by the authors over a period of approximately 16 months during the period November 1993 to May 1995 in an attempt to explore a set of open research issues and to
Who gets acknowledged: Measuring scientific contributions through automatic acknowledgment indexing
It is argued that acknowledgements can be considered as a metric parallel to citations in the academic audit process and shown that combining acknowledgment analysis with citation indexing yields a measurable impact of the efficacy of various individuals as well as government, corporate, and university sponsors of scientific work.