Automatic Identification and Data Extraction from 2-Dimensional Plots in Digital Documents

  author={William Browuer and Saurabh Kataria and Sujatha Das Gollapalli and Prasenjit Mitra and C. Lee Giles},
Most search engines index the textual content of documents in digital libraries. However, scholarly articles frequently report important findings in figures for visual impact and the contents of these figures are not indexed. These contents are often invaluable to the researcher in various fields, for the purposes of direct comparison with their own work. Therefore, searching for figures and extracting figure data are important problems. To the best of our knowledge, there exists no tool to… 
Document Retrieval Using SIFT Image Features
A new approach to document classification based on visual features alone is described, showing that using visual features substantially outperforms text- based approaches for noisy text and shows that visual features are capable of capturing the semantics of the documents to enable useful retrieval systems to be constructed.
Graphics Classification for Enterprise Knowledge Management
A machine learning approach for graphics classification that automatically classifies graphics within enterprise documents into an enterprise graphics taxonomy and enables graphics search functionality to augment traditional document-centric enterprise search is described.
Document retrieval using image features
It is shown that using visual features substantially outperforms text-based approaches for noisy text, giving average precision in the range 0.4--0.43 in several experiments retrieving scientific papers.


Automatic Extraction of Data from 2-D Plots in Documents
This work proposes an automated algorithm for extracting information from line curves in 2-D plots that can be stored in a database and indexed to answer end-user queries and enhance search results.
Automatic categorization of figures in scientific documents
A machine-learning-based approach for automatic categorization of figures based on their functionalities in scholarly articles is developed and can be integrated into a scientific-document digital library.
CiteSeer: an automatic citation indexing system
CiteSeer has many advantages over traditional citation indexes, including the ability to create more up-to-date databases which are not limited to a preselected set of journals or restricted by journal publication delays, completely autonomous operation with a corresponding reduction in cost, and powerful interactive browsing of the literature using the context of citations.
Use of the Hough transformation to detect lines and curves in pictures
It is pointed out that the use of angle-radius rather than slope-intercept parameters simplifies the computation further, and how the method can be used for more general curve fitting.
Image Processing by Simulated Annealing
It is shown that simulated annealing, a statistical mechanics method recently proposed as a tool in solving complex optimization problems, can be used in problems arising in image processing, and some of these problems are formally equivalent to ground state problems for two-dimensional Ising spin systems.
Context-based multiscale classification of document images using wavelet coefficient distributions
  • Jia Li, R. Gray
  • Mathematics, Computer Science
    IEEE Trans. Image Process.
  • 2000
An algorithm is developed for segmenting document images into four classes: background, photograph, text, and graph based on the distribution patterns of wavelet coefficients in high frequency bands, enabling accurate classification at class boundaries as well as fast classification overall.
Practical Algorithms for Image Analysis: Description, Examples, and Code
A Tutorial on Support Vector Machines for Pattern Recognition
The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization. We then describe linear Support Vector Machines (SVMs) for separable and non-separable data, w...