• Corpus ID: 16285667

Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers

  title={Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers},
  author={Christopher Clark and Santosh Kumar Divvala},
  booktitle={AAAI Workshop: Scholarly Big Data},
Identifying and extracting figures and tables along with their captions from scholarly articles is important both as a way of providing tools for article summarization, and as part of larger systems that seek to gain deeper, semantic understanding of these articles. [] Key Method This method can extract a wide variety of figures because it does not make strong assumptions about the format of the figures embedded in the document, as long as they can be differentiated from the main article's text.

Figures and Tables from this paper

Automatic Extraction of Figures from Scholarly Documents

The challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale are discussed and three new evaluation metrics are defined: figure-precision, figure-recall, and figure-F1-score are defined.

Data and text mining Figure and caption extraction from biomedical documents

This work introduces a new and effective system for figure and caption extraction, PDFigCapX, which first separate between text and graphical contents, and then utilize layout information to effectively detect and extract figures and captions.

Scalable algorithms for scholarly figure mining and semantics

Most scholarly papers contain one or multiple figures. Often these figures show experimental results, e.g, line graphs are used to compare various methods. Compared to the text of the paper, figures

FigureSeer: Parsing Result-Figures in Research Papers

This paper introduces FigureSeer, an end-to-end framework for parsing result-figures, that enables powerful search and retrieval of results in research papers and formulates a novel graph-based reasoning approach using a CNN-based similarity metric.

Line-items and table understanding in structured documents

This work presents a comprehensive representation of a page using graph over word boxes, positional embeddings, trainable textual features and proposes a novel neural network model that achieves strong, practical results on the presented dataset.

Prediction of importance of figures in scholarly papers

  • Yui KitaJ. Rekimoto
  • Education
    2017 Twelfth International Conference on Digital Information Management (ICDIM)
  • 2017
This paper shows that the importance of a figure in scholarly papers can be predicted by a machine learning technique based on a comparison of the sizes, page numbers or color features of the figures.

Table Understanding in Structured Documents

This work presents a comprehensive representation of a page using graph over word boxes, positional embeddings, trainable textual features and proposes a novel neural network model that achieves strong, practical results on the presented dataset.

SideNoter: Scholarly Paper Browsing System based on PDF Restructuring and Text Annotation

This system provides ways to extract natural language sentences from PDF files together with their logical structures, and also to map arbitrary textual spans to their corresponding regions on page images, and is planned to make widely available to NLP researchers.

Understanding Charts in Research Papers : A Learning Approach

The goal is to understand figures in research papers by parsing them into a structured, computer-readable representation by fully automating the pipeline from input papers to output results, by allowing reading of multiple variables plotted on the same axis, and by introducing a quantitative metric for evaluating data extraction.

Convolutional Neural Networks for Figure Extraction in Historical Technical Documents

This work treats the extraction of figures and images from the pages of scanned documents as a computer vision problem, and trains convolutional neural networks to recognize figures in scanned pages to achieve precision and recall above 80% and transfer very well to historical scans.



Figure Metadata Extraction from Digital Documents

This work describes the very first step in indexing, classification and data extraction from figures in PDF documents - accurate automatic extraction of figures and associated metadata, a nontrivial task.

An Automatic System for Extracting Figures and Captions in Biomedical PDF Documents

This paper presents an automatic system for robustly harvesting figures from biomedical literature, based on the idea that the PDF specification of the document layout can be used to identify encoded figures and figure boundaries within the PDF and enforce constraints among figure-regions.

On methods and tools of table detection, extraction and annotation in PDF documents

This work investigates the state of the art in table detection, extraction and annotation in PDF documents, finding very limited attention towards these aspects in books, especially books in PDF format.

Automatic Extraction of Figures from Scientific Publications in High-Energy Physics

This paper presents a novel solution for the initial problem of processing graphicalcontent, obtaining figures from scholarly publications stored in PDF format that depends on vector properties of documents and does not introduce additional errors, characteristic for methods based on raster image processing.

Yale Image Finder (YIF): a new search engine for retrieving biomedical images

YIF is a publicly accessible search engine featuring a new way of retrieving biomedical images and associated papers based on the text carried inside the images, allowing users to find related papers starting with an image of interest.

A survey of table recognition

This presentation clarifies both the decisions made by a table recognizer and the assumptions and inferencing techniques that underlie these decisions.

Large Graph Construction for Scalable Semi-Supervised Learning

This paper addresses the scalability issue plaguing graph-based semi-supervised learning via a small number of anchor points which adequately cover the entire point cloud via a unique idea called AnchorGraph which provides nonnegative adjacency matrices to guarantee positive semidefinite graph Laplacians.

The Pascal Visual Object Classes (VOC) Challenge

The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.

An Overview of the Tesseract OCR Engine

  • R. Smith
  • Computer Science
    Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)
  • 2007
The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview. Emphasis is placed on aspects that are novel or at

False-Name Manipulations in Weighted Voting Games

This paper investigates by how much a player can change his power, as measured by the Shapley-Shubik index or the Banzhaf index, by means of a false-name manipulation, i.e., splitting his weight among two or more identities.