CiteSeerx: an architecture and web service design for an academic document search engine

  title={CiteSeerx: an architecture and web service design for an academic document search engine},
  author={Huajing Li and Isaac G. Councill and Wang-Chien Lee and C. Lee Giles},
  booktitle={WWW '06},
CiteSeer is a scientific literature digital library and search engine which automatically crawls and indexes scientific documents in the field of computer and information science. After serving as a public search engine for nearly ten years, CiteSeer is starting to have scaling problems for handling of more documents, adding new feature and more users. Its monolithic architecture design prevents it from effectively making use of new web technologies and providing new services. After analyzing… 

Figures from this paper

SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web
The design of SeerSuite is described and the deployment and usage of CiteSeerx is described as an instance of SearSuite, which enables access to extensive document, citation, and author metadata by automatically extracting, storing and indexing metadata.
Computational Issues in Digital Library Search Engines
Details of a scalable and portable system built using message oriented middleware architecture with a publish/subscribe approach and can be deployed across different physical and cloud infrastructure are presented.
An entity profile schema for data integration in an academic metasearch engine
This work proposes an approach to the entity profile integration problem based on the selection of profiles with shared attributes and shows that the proposed approach enhances the performance of the metasearch engine.
Building a Search Engine for Algorithms
An initial prototype of AlgorithmSeer, a system for extracting, indexing, and searching for algorithms in scholarly documents, is described, which can search through a large collection of scholarly documents or author homepages.
Efficient Exploration of Algorithms in Scholarly Documents Using Big Data Analytics
This paper proposes a method to develop an algorithm search engine that analyzes a document to discover any algorithm that may be there in the document and extracts additional information about the algorithm.
AlgorithmSeer: A System for Extracting and Searching for Algorithms in Scholarly Big Data
A novel set of scalable techniques used by AlgorithmSeer to identify and extract algorithm representations in a heterogeneous pool of scholarly documents are proposed and hybrid machine learning approaches are proposed to discover algorithm representations.
Global citation recommendation using knowledge graphs
This work focuses on a setting where the user provides only the abstract of a new paper as input, and proposes a model to expand the semantic features of the given abstract using knowledge graphs and combine them with other features to fit a learning to rank model.
BibMiner: A service-oriented framework for bibliographic analysis service
  • Bin Wu, Hongqiao Tian
  • Computer Science
    2010 2nd IEEE InternationalConference on Network Infrastructure and Digital Content
  • 2010
This paper elaborate a prototype of the on-going constructed system, BibMiner, for digging hidden knowledge from the large-scale literature records using a combination of Service-Oriented approaches and complex network theory.
A Novel Parallel Architecture Design of Information Retrieval System for Scientific Papers
This work proposes SciPDFindexer, distributed information retrieval system for scientific articles in PDF, which parses and extracts metadata from articles, and then indexes extracted content using the MapReduce scheme.
"Building a search engine for algorithms" by Suppawong Tuarob, Prasenjit Mitra, and C. Lee Giles with Martin Vesely as coordinator
An initial prototype of AlgorithmSeer is described, a system for extracting, indexing, and searching for algorithms in scholarly documents, and current issues and future directions, such as algorithm information extraction and classification, are discussed.


A service-oriented architecture for digital libraries
The effort presented here towards the Semantic-integration of a complex Information Retrieval system could be used as an integration model for arbitrary systems.
Digital Libraries and Autonomous Citation Indexing
Digital libraries incorporating ACI can help organize scientific literature and may significantly improve the efficiency of dissemination and feedback and speed the transition to scholarly electronic publishing.
A framework for distributed digital object services
The following paper was written by the authors over a period of approximately 16 months during the period November 1993 to May 1995 in an attempt to explore a set of open research issues and to
Who gets acknowledged: Measuring scientific contributions through automatic acknowledgment indexing
It is argued that acknowledgements can be considered as a metric parallel to citations in the academic audit process and shown that combining acknowledgment analysis with citation indexing yields a measurable impact of the efficacy of various individuals as well as government, corporate, and university sponsors of scientific work.