PDFFigures 2.0: Mining figures from research papers

  title={PDFFigures 2.0: Mining figures from research papers},
  author={Christopher Clark and Santosh Kumar Divvala},
  journal={2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)},
Figures and tables are key sources of information in many scholarly documents. However, current academic search engines do not make use of figures and tables when semantically parsing documents or presenting document summaries to users. To facilitate these applications we develop an algorithm that extracts figures, tables, and captions from documents called “PDFFigures 2.0.” Our proposed approach analyzes the structure of individual pages by detecting captions, graphical elements, and chunks of… Expand
Extracting Figures and Captions from Scientific Publications
This paper introduces a new and effective system for figure and caption extraction, PDFigCapX, which separate text from graphical contents and utilize layout information to detect and disambiguate figures and captions. Expand
Data and text mining Figure and caption extraction from biomedical documents
Motivation: Figures and captions convey essential information in biomedical documents. As such, there is a growing interest in mining published biomedical figures and in utilizing their respectiveExpand
Figure and caption extraction from biomedical documents
A new and effective system for figure and caption extraction, PDFigCapX, which first separate between text and graphical contents, and then utilize layout information to effectively detect and extract figures and captions. Expand
FCENet: An Instance Segmentation Model for Extracting Figures and Captions From Material Documents
This study splits the BlendMask detection head into two branches, i.e., figure detection and caption detection, which increases final detection accuracy and speed, and builds upon BlendMask and introduces a horizontal and vertical attention module. Expand
A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents
A machine learning based system that extracts and recognizes the various data fields present in a bar chart for semantic labeling and is tested on a set of over 200 bar charts extracted from over 1,000 scientific articles in PDF format. Expand
SCICAP: Generating Captions for Scientific Figures
  • Ting-Yao Hsu, C. Lee Giles, Ting-Hao ‘Kenneth’ Huang
  • 2021
Researchers use figures to communicate rich, complex information in scientific papers. The captions of these figures are critical to conveying effective messages. However, low-quality figure captionsExpand
Convolutional Neural Networks for Figure Extraction in Historical Technical Documents
This work treats the extraction of figures and images from the pages of scanned documents as a computer vision problem, and trains convolutional neural networks to recognize figures in scanned pages to achieve precision and recall above 80% and transfer very well to historical scans. Expand
Tab2Know: Building a Knowledge Base from Tables in Scientific Papers
A pipeline that employs both statistical-based classifiers and logic-based reasoning to build a Knowledge Base from tables in scientific papers, and an empirical evaluation suggests that this is a promising step to create a large-scale KB of scientific knowledge. Expand
Extracting Scientific Figures with Distantly Supervised Neural Networks
This paper induces high-quality training labels for the task of figure extraction in a large number of scientific documents, with no human intervention, and uses this dataset to train a deep neural network for end-to-end figure detection, yielding a model that can be more easily extended to new domains compared to previous work. Expand
ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations
ScanBank is a new dataset containing 10 thousand scanned page images, manually labeled by humans as to the presence of the 3.3 thousand figures or tables found therein, used to train a deep neural network model based on YOLOv5 to accurately extract figures and tables from scanned ETDs. Expand


Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers
This work introduces a new dataset of 150 computer science papers along with ground truth labels for the locations of the figures, tables and captions within them and demonstrates a caption-to-figure matching component that is effective even in cases where individual captions are adjacent to multiple figures. Expand
Automatic Extraction of Figures from Scholarly Documents
The challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale are discussed and three new evaluation metrics are defined: figure-precision, figure-recall, and figure-F1-score are defined. Expand
Understanding Charts in Research Papers : A Learning Approach
Academic research papers are a rich source of information, making them a prime target for computer knowledge mining. While much previous work has focused on reading text, papers often use graphicalExpand
Logical Structure Recovery in Scholarly Articles with Rich Document Features
SectLabel is described, a module that further develops existing software to detect the logical structure of a document from existing PDF files, using the formalism of conditional random fields. Expand
Automated Data Extraction from Scholarly Line Graphs
Line graphs are ubiquitous in scholarly papers. They are usually generated from a data table and often used to compare performances of various methods. The data in these figures can not be accessed.Expand
PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search
We introduce PDFMEF, a multi-entity knowledge extraction framework for scholarly documents in the PDF format. It is implemented with a framework that encapsulates open-source extraction tools.Expand
Curve separation for line graphs in scholarly documents
A system to extract line graphs from scholarly PDFs as SVG images is reported and it is shown how that can improve both the accuracy and the scalability of the curve separation problem. Expand
Automatic Extraction of Figures from Scientific Publications in High-Energy Physics
This paper presents a novel solution for the initial problem of processing graphicalcontent, obtaining figures from scholarly publications stored in PDF format that depends on vector properties of documents and does not introduce additional errors, characteristic for methods based on raster image processing. Expand
ParsCit: an Open-source CRF Reference String Parsing Package
Parsing package ParsCit is described, a freely available, open-source implementation of a reference string parsing package that wraps a trained conditional random field model with added functionality to identify reference strings from a plain text file, and to retrieve the citation contexts. Expand
GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications
Based on state of the art machine learning techniques, GROBID (GeneRation Of BIbliographic Data) performs reliable bibliographic data extractions from scholar articles combined with multi-level termExpand