Corpus ID: 16285667

Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers

@inproceedings{Clark2015LookingBT,
  title={Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers},
  author={Christopher Clark and S. Divvala},
  booktitle={AAAI Workshop: Scholarly Big Data},
  year={2015}
}
Identifying and extracting figures and tables along with their captions from scholarly articles is important both as a way of providing tools for article summarization, and as part of larger systems that seek to gain deeper, semantic understanding of these articles. [...] Key Method This method can extract a wide variety of figures because it does not make strong assumptions about the format of the figures embedded in the document, as long as they can be differentiated from the main article's text.Expand
PDFFigures 2.0: Mining figures from research papers
TLDR
An algorithm that extracts figures, tables, and captions from documents called “PDFFigures 2.0” that analyzes the structure of individual pages by detecting captions, graphical elements, and chunks of body text, and then locates figures and tables by reasoning about the empty regions within that text. Expand
Automatic Extraction of Figures from Scholarly Documents
TLDR
The challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale are discussed and three new evaluation metrics are defined: figure-precision, figure-recall, and figure-F1-score are defined. Expand
Data and text mining Figure and caption extraction from biomedical documents
Motivation: Figures and captions convey essential information in biomedical documents. As such, there is a growing interest in mining published biomedical figures and in utilizing their respectiveExpand
Figure and caption extraction from biomedical documents
TLDR
A new and effective system for figure and caption extraction, PDFigCapX, which first separate between text and graphical contents, and then utilize layout information to effectively detect and extract figures and captions. Expand
A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents
TLDR
A machine learning based system that extracts and recognizes the various data fields present in a bar chart for semantic labeling and is tested on a set of over 200 bar charts extracted from over 1,000 scientific articles in PDF format. Expand
Scalable algorithms for scholarly figure mining and semantics
Most scholarly papers contain one or multiple figures. Often these figures show experimental results, e.g, line graphs are used to compare various methods. Compared to the text of the paper, figuresExpand
FigureSeer: Parsing Result-Figures in Research Papers
TLDR
This paper introduces FigureSeer, an end-to-end framework for parsing result-figures, that enables powerful search and retrieval of results in research papers and formulates a novel graph-based reasoning approach using a CNN-based similarity metric. Expand
Line-items and table understanding in structured documents
TLDR
This work presents a comprehensive representation of a page using graph over word boxes, positional embeddings, trainable textual features and proposes a novel neural network model that achieves strong, practical results on the presented dataset. Expand
Prediction of importance of figures in scholarly papers
  • Yui Kita, J. Rekimoto
  • Computer Science
  • 2017 Twelfth International Conference on Digital Information Management (ICDIM)
  • 2017
TLDR
This paper shows that the importance of a figure in scholarly papers can be predicted by a machine learning technique based on a comparison of the sizes, page numbers or color features of the figures. Expand
Table Understanding in Structured Documents
TLDR
This work presents a comprehensive representation of a page using graph over word boxes, positional embeddings, trainable textual features and proposes a novel neural network model that achieves strong, practical results on the presented dataset. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 15 REFERENCES
Figure Metadata Extraction from Digital Documents
TLDR
This work describes the very first step in indexing, classification and data extraction from figures in PDF documents - accurate automatic extraction of figures and associated metadata, a nontrivial task. Expand
An Automatic System for Extracting Figures and Captions in Biomedical PDF Documents
TLDR
This paper presents an automatic system for robustly harvesting figures from biomedical literature, based on the idea that the PDF specification of the document layout can be used to identify encoded figures and figure boundaries within the PDF and enforce constraints among figure-regions. Expand
On methods and tools of table detection, extraction and annotation in PDF documents
TLDR
This work investigates the state of the art in table detection, extraction and annotation in PDF documents, finding very limited attention towards these aspects in books, especially books in PDF format. Expand
Automatic Extraction of Figures from Scientific Publications in High-Energy Physics
TLDR
This paper presents a novel solution for the initial problem of processing graphicalcontent, obtaining figures from scholarly publications stored in PDF format that depends on vector properties of documents and does not introduce additional errors, characteristic for methods based on raster image processing. Expand
CiteSeerX: AI in a Digital Library Search Engine
TLDR
This work presents key AI technologies used in the following components: document classification and de-duplication, document and citation clustering, automatic metadata extraction and indexing, and author disambiguation in CiteSeerX. Expand
Yale Image Finder (YIF): a new search engine for retrieving biomedical images
TLDR
YIF is a publicly accessible search engine featuring a new way of retrieving biomedical images and associated papers based on the text carried inside the images, allowing users to find related papers starting with an image of interest. Expand
A survey of table recognition
TLDR
This presentation clarifies both the decisions made by a table recognizer and the assumptions and inferencing techniques that underlie these decisions. Expand
Large Graph Construction for Scalable Semi-Supervised Learning
TLDR
This paper addresses the scalability issue plaguing graph-based semi-supervised learning via a small number of anchor points which adequately cover the entire point cloud via a unique idea called AnchorGraph which provides nonnegative adjacency matrices to guarantee positive semidefinite graph Laplacians. Expand
The Pascal Visual Object Classes (VOC) Challenge
TLDR
The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse. Expand
The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics
TLDR
This is a post-print of a paper from Sixth International Conference on Language Resources and Evaluation 2008, where six papers were presented, one of which was new to the literature. Expand
...
1
2
...