PDFFigures 2.0: Mining figures from research papers

  title={PDFFigures 2.0: Mining figures from research papers},
  author={Christopher Clark and Santosh Kumar Divvala},
  journal={2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)},
Figures and tables are key sources of information in many scholarly documents. However, current academic search engines do not make use of figures and tables when semantically parsing documents or presenting document summaries to users. To facilitate these applications we develop an algorithm that extracts figures, tables, and captions from documents called “PDFFigures 2.0.” Our proposed approach analyzes the structure of individual pages by detecting captions, graphical elements, and chunks of… 

Extracting Figures and Captions from Scientific Publications

This paper introduces a new and effective system for figure and caption extraction, PDFigCapX, which separate text from graphical contents and utilize layout information to detect and disambiguate figures and captions.

Parsing AUC Result-Figures in Machine Learning Specific Scholarly Documents for Semantically-enriched Summarization

This paper proposes creating semantically enriched document summaries by extracting meaningful data from the results-figures specific to the evaluation metric of the area under the curve (AUC) and their associated captions from full-text documents and observes that figure specialized summaries are more comprehensive andSemantically enriched.

Data and text mining Figure and caption extraction from biomedical documents

This work introduces a new and effective system for figure and caption extraction, PDFigCapX, which first separate between text and graphical contents, and then utilize layout information to effectively detect and extract figures and captions.

Figure and caption extraction from biomedical documents

A new and effective system for figure and caption extraction, PDFigCapX, which first separate between text and graphical contents, and then utilize layout information to effectively detect and extract figures and captions.

FCENet: An Instance Segmentation Model for Extracting Figures and Captions From Material Documents

This study splits the BlendMask detection head into two branches, i.e., figure detection and caption detection, which increases final detection accuracy and speed, and builds upon BlendMask and introduces a horizontal and vertical attention module.

A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents

A machine learning based system that extracts and recognizes the various data fields present in a bar chart for semantic labeling and is tested on a set of over 200 bar charts extracted from over 1,000 scientific articles in PDF format.

Convolutional Neural Networks for Figure Extraction in Historical Technical Documents

This work treats the extraction of figures and images from the pages of scanned documents as a computer vision problem, and trains convolutional neural networks to recognize figures in scanned pages to achieve precision and recall above 80% and transfer very well to historical scans.

Tab2Know: Building a Knowledge Base from Tables in Scientific Papers

A pipeline that employs both statistical-based classifiers and logic-based reasoning to build a Knowledge Base from tables in scientific papers, and an empirical evaluation suggests that this is a promising step to create a large-scale KB of scientific knowledge.

Extracting Scientific Figures with Distantly Supervised Neural Networks

This paper induces high-quality training labels for the task of figure extraction in a large number of scientific documents, with no human intervention, and uses this dataset to train a deep neural network for end-to-end figure detection, yielding a model that can be more easily extended to new domains compared to previous work.

ChartText: Linking Text with Charts in Documents

ChartText is presented, a method that links text with visualizations in this work that supports documents that include bar charts, line charts, and scatter plots and can automatically annotate a chart following the presenter’s description.



Automatic Extraction of Figures from Scholarly Documents

The challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale are discussed and three new evaluation metrics are defined: figure-precision, figure-recall, and figure-F1-score are defined.

Understanding Charts in Research Papers : A Learning Approach

The goal is to understand figures in research papers by parsing them into a structured, computer-readable representation by fully automating the pipeline from input papers to output results, by allowing reading of multiple variables plotted on the same axis, and by introducing a quantitative metric for evaluating data extraction.

Logical Structure Recovery in Scholarly Articles with Rich Document Features

SectLabel is described, a module that further develops existing software to detect the logical structure of a document from existing PDF files, using the formalism of conditional random fields.

Automated Data Extraction from Scholarly Line Graphs

An analysis of line graphs is reported to explain the challenges of building a fully automated data extraction system and a novel curve extraction method is proposed that has an average accuracy of 82%.

PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search

We introduce PDFMEF, a multi-entity knowledge extraction framework for scholarly documents in the PDF format. It is implemented with a framework that encapsulates open-source extraction tools.

Curve separation for line graphs in scholarly documents

A system to extract line graphs from scholarly PDFs as SVG images is reported and it is shown how that can improve both the accuracy and the scalability of the curve separation problem.

Automatic Extraction of Figures from Scientific Publications in High-Energy Physics

This paper presents a novel solution for the initial problem of processing graphicalcontent, obtaining figures from scholarly publications stored in PDF format that depends on vector properties of documents and does not introduce additional errors, characteristic for methods based on raster image processing.

ParsCit: an Open-source CRF Reference String Parsing Package

Parsing package ParsCit is described, a freely available, open-source implementation of a reference string parsing package that wraps a trained conditional random field model with added functionality to identify reference strings from a plain text file, and to retrieve the citation contexts.

GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications

Based on state of the art machine learning techniques, GROBID (GeneRation Of BIbliographic Data) performs reliable bibliographic data extractions from scholar articles combined with multi-level term

Text detection in screen images with a Convolutional Neural Network

The repository contains a set of scripts to implement text detection from screen images using a Convolutional Neural Network to predict a heatmap of the probability of text in an image.