• Corpus ID: 16285667

Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers

@inproceedings{Clark2015LookingBT,
  title={Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers},
  author={Christopher Clark and Santosh Kumar Divvala},
  booktitle={AAAI Workshop: Scholarly Big Data},
  year={2015}
}
Identifying and extracting figures and tables along with their captions from scholarly articles is important both as a way of providing tools for article summarization, and as part of larger systems that seek to gain deeper, semantic understanding of these articles. [] Key Method This method can extract a wide variety of figures because it does not make strong assumptions about the format of the figures embedded in the document, as long as they can be differentiated from the main article's text.

Figures and Tables from this paper

PDFFigures 2.0: Mining figures from research papers

TLDR
An algorithm that extracts figures, tables, and captions from documents called “PDFFigures 2.0” that analyzes the structure of individual pages by detecting captions, graphical elements, and chunks of body text, and then locates figures and tables by reasoning about the empty regions within that text.

Automatic Extraction of Figures from Scholarly Documents

TLDR
The challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale are discussed and three new evaluation metrics are defined: figure-precision, figure-recall, and figure-F1-score are defined.

Data and text mining Figure and caption extraction from biomedical documents

TLDR
This work introduces a new and effective system for figure and caption extraction, PDFigCapX, which first separate between text and graphical contents, and then utilize layout information to effectively detect and extract figures and captions.

Figure and caption extraction from biomedical documents

TLDR
A new and effective system for figure and caption extraction, PDFigCapX, which first separate between text and graphical contents, and then utilize layout information to effectively detect and extract figures and captions.

A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents

TLDR
A machine learning based system that extracts and recognizes the various data fields present in a bar chart for semantic labeling and is tested on a set of over 200 bar charts extracted from over 1,000 scientific articles in PDF format.

Scalable algorithms for scholarly figure mining and semantics

Most scholarly papers contain one or multiple figures. Often these figures show experimental results, e.g, line graphs are used to compare various methods. Compared to the text of the paper, figures

FigureSeer: Parsing Result-Figures in Research Papers

TLDR
This paper introduces FigureSeer, an end-to-end framework for parsing result-figures, that enables powerful search and retrieval of results in research papers and formulates a novel graph-based reasoning approach using a CNN-based similarity metric.

Line-items and table understanding in structured documents

TLDR
This work presents a comprehensive representation of a page using graph over word boxes, positional embeddings, trainable textual features and proposes a novel neural network model that achieves strong, practical results on the presented dataset.

Prediction of importance of figures in scholarly papers

  • Yui KitaJ. Rekimoto
  • Education
    2017 Twelfth International Conference on Digital Information Management (ICDIM)
  • 2017
TLDR
This paper shows that the importance of a figure in scholarly papers can be predicted by a machine learning technique based on a comparison of the sizes, page numbers or color features of the figures.

Table Understanding in Structured Documents

TLDR
This work presents a comprehensive representation of a page using graph over word boxes, positional embeddings, trainable textual features and proposes a novel neural network model that achieves strong, practical results on the presented dataset.
...

References

SHOWING 1-10 OF 15 REFERENCES

Figure Metadata Extraction from Digital Documents

TLDR
This work describes the very first step in indexing, classification and data extraction from figures in PDF documents - accurate automatic extraction of figures and associated metadata, a nontrivial task.

An Automatic System for Extracting Figures and Captions in Biomedical PDF Documents

TLDR
This paper presents an automatic system for robustly harvesting figures from biomedical literature, based on the idea that the PDF specification of the document layout can be used to identify encoded figures and figure boundaries within the PDF and enforce constraints among figure-regions.

On methods and tools of table detection, extraction and annotation in PDF documents

TLDR
This work investigates the state of the art in table detection, extraction and annotation in PDF documents, finding very limited attention towards these aspects in books, especially books in PDF format.

CiteSeerX: AI in a Digital Library Search Engine

TLDR
This work presents key AI technologies used in the following components of CiteSeerX: document classification and deduplication, document and citation clustering, automatic metadata extraction and indexing, and author disambiguation.

Yale Image Finder (YIF): a new search engine for retrieving biomedical images

TLDR
YIF is a publicly accessible search engine featuring a new way of retrieving biomedical images and associated papers based on the text carried inside the images, allowing users to find related papers starting with an image of interest.

A survey of table recognition

TLDR
This presentation clarifies both the decisions made by a table recognizer and the assumptions and inferencing techniques that underlie these decisions.

The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics

TLDR
This is a post-print of a paper from Sixth International Conference on Language Resources and Evaluation 2008, where six papers were presented, one of which was new to the literature.

Large Graph Construction for Scalable Semi-Supervised Learning

TLDR
This paper addresses the scalability issue plaguing graph-based semi-supervised learning via a small number of anchor points which adequately cover the entire point cloud via a unique idea called AnchorGraph which provides nonnegative adjacency matrices to guarantee positive semidefinite graph Laplacians.

The Pascal Visual Object Classes (VOC) Challenge

TLDR
The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.

An Overview of the Tesseract OCR Engine

  • R. Smith
  • Computer Science
    Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)
  • 2007
The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview. Emphasis is placed on aspects that are novel or at