PDFFigures 2.0: Mining figures from research papers

@article{Clark2016PDFFigures2M,
  title={PDFFigures 2.0: Mining figures from research papers},
  author={Christopher Clark and Santosh Kumar Divvala},
  journal={2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)},
  year={2016},
  pages={143-152}
}
Figures and tables are key sources of information in many scholarly documents. However, current academic search engines do not make use of figures and tables when semantically parsing documents or presenting document summaries to users. To facilitate these applications we develop an algorithm that extracts figures, tables, and captions from documents called “PDFFigures 2.0.” Our proposed approach analyzes the structure of individual pages by detecting captions, graphical elements, and chunks of… 

Extracting Figures and Captions from Scientific Publications

TLDR
This paper introduces a new and effective system for figure and caption extraction, PDFigCapX, which separate text from graphical contents and utilize layout information to detect and disambiguate figures and captions.

Parsing AUC Result-Figures in Machine Learning Specific Scholarly Documents for Semantically-enriched Summarization

TLDR
This paper proposes creating semantically enriched document summaries by extracting meaningful data from the results-figures specific to the evaluation metric of the area under the curve (AUC) and their associated captions from full-text documents and observes that figure specialized summaries are more comprehensive andSemantically enriched.

Data and text mining Figure and caption extraction from biomedical documents

TLDR
This work introduces a new and effective system for figure and caption extraction, PDFigCapX, which first separate between text and graphical contents, and then utilize layout information to effectively detect and extract figures and captions.

Figure and caption extraction from biomedical documents

TLDR
A new and effective system for figure and caption extraction, PDFigCapX, which first separate between text and graphical contents, and then utilize layout information to effectively detect and extract figures and captions.

FCENet: An Instance Segmentation Model for Extracting Figures and Captions From Material Documents

TLDR
This study splits the BlendMask detection head into two branches, i.e., figure detection and caption detection, which increases final detection accuracy and speed, and builds upon BlendMask and introduces a horizontal and vertical attention module.

A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents

TLDR
A machine learning based system that extracts and recognizes the various data fields present in a bar chart for semantic labeling and is tested on a set of over 200 bar charts extracted from over 1,000 scientific articles in PDF format.

Convolutional Neural Networks for Figure Extraction in Historical Technical Documents

TLDR
This work treats the extraction of figures and images from the pages of scanned documents as a computer vision problem, and trains convolutional neural networks to recognize figures in scanned pages to achieve precision and recall above 80% and transfer very well to historical scans.

Tab2Know: Building a Knowledge Base from Tables in Scientific Papers

TLDR
A pipeline that employs both statistical-based classifiers and logic-based reasoning to build a Knowledge Base from tables in scientific papers, and an empirical evaluation suggests that this is a promising step to create a large-scale KB of scientific knowledge.

Extracting Scientific Figures with Distantly Supervised Neural Networks

TLDR
This paper induces high-quality training labels for the task of figure extraction in a large number of scientific documents, with no human intervention, and uses this dataset to train a deep neural network for end-to-end figure detection, yielding a model that can be more easily extended to new domains compared to previous work.

ChartText: Linking Text with Charts in Documents

TLDR
ChartText is presented, a method that links text with visualizations in this work that supports documents that include bar charts, line charts, and scatter plots and can automatically annotate a chart following the presenter’s description.
...

References

SHOWING 1-10 OF 15 REFERENCES

Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers

TLDR
This work introduces a new dataset of 150 computer science papers along with ground truth labels for the locations of the figures, tables and captions within them and demonstrates a caption-to-figure matching component that is effective even in cases where individual captions are adjacent to multiple figures.

Automatic Extraction of Figures from Scholarly Documents

TLDR
The challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale are discussed and three new evaluation metrics are defined: figure-precision, figure-recall, and figure-F1-score are defined.

Understanding Charts in Research Papers : A Learning Approach

TLDR
The goal is to understand figures in research papers by parsing them into a structured, computer-readable representation by fully automating the pipeline from input papers to output results, by allowing reading of multiple variables plotted on the same axis, and by introducing a quantitative metric for evaluating data extraction.

Logical Structure Recovery in Scholarly Articles with Rich Document Features

TLDR
SectLabel is described, a module that further develops existing software to detect the logical structure of a document from existing PDF files, using the formalism of conditional random fields.

Automated Data Extraction from Scholarly Line Graphs

TLDR
An analysis of line graphs is reported to explain the challenges of building a fully automated data extraction system and a novel curve extraction method is proposed that has an average accuracy of 82%.

PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search

We introduce PDFMEF, a multi-entity knowledge extraction framework for scholarly documents in the PDF format. It is implemented with a framework that encapsulates open-source extraction tools.

Curve separation for line graphs in scholarly documents

TLDR
A system to extract line graphs from scholarly PDFs as SVG images is reported and it is shown how that can improve both the accuracy and the scalability of the curve separation problem.

ParsCit: an Open-source CRF Reference String Parsing Package

TLDR
Parsing package ParsCit is described, a freely available, open-source implementation of a reference string parsing package that wraps a trained conditional random field model with added functionality to identify reference strings from a plain text file, and to retrieve the citation contexts.

GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications

Based on state of the art machine learning techniques, GROBID (GeneRation Of BIbliographic Data) performs reliable bibliographic data extractions from scholar articles combined with multi-level term

Spark: Cluster Computing with Working Sets

TLDR
Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.