A metadata and annotation extractor from PDF document for semantic web


Research scholars undertake literature survey to identify and problem which they would like to address and possible solutions. As the part of this activity, they download research papers from internet, read them and write comments, observations, explanations or questions either on a separate sheet of a paper or on the paper itself. They use these notes and observations to firm up their understanding of research domain and to define their research problems. These notes and observations are very valuable knowledge asset for the research. My work is motivated by a desire to capture and to make it available to the community of research scholars, so that they can be benefited from them. In this paper, I present an editor which facilitates authoring annotations on PDF documents. I have designed a DTD (Document Type Definition) for annotation document. This DTD contains identity of annotation Author, identity of the paper on which annotation will be created, <i>Type</i> of annotation, <i>Comment</i> and Date_time elements. This type field is of enumeration type and may take a value "note", "comment", "insert", "help", "paragraph". "insert" is used to state that the annotation is not on the original PDF document but it is on another annotation. My tool provides a user-friendly interface to query these annotations on PDF document, to classify document on the basis of number of comments and also the relationships between annotations. My tool also extracts metadata from the PDF document. This metadata includes title, author, keywords, summary and date_time. This tool has been implemented using API of java PDF Box.

DOI: 10.1145/1858378.1858426

Extracted Key Phrases

12 Figures and Tables

Cite this paper

@inproceedings{Shukla2010AMA, title={A metadata and annotation extractor from PDF document for semantic web}, author={Archana Shukla}, booktitle={A2CWiC}, year={2010} }