Extracting Scientific Figures with Distantly Supervised Neural Networks

  title={Extracting Scientific Figures with Distantly Supervised Neural Networks},
  author={Noah Siegel and Nicholas Lourie and Russell Power and Waleed Ammar},
  journal={Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries},
Non-textual components such as charts, diagrams and tables provide key information in many scientific documents, but the lack of large labeled datasets has impeded the development of data-driven methods for scientific figure extraction. [] Key Method We share the resulting dataset of over 5.5 million induced labels---4,000 times larger than the previous largest figure extraction dataset---with an average precision of 96.8%, to enable the development of modern data-driven methods for this task.

Figures and Tables from this paper

MexPub: Deep Transfer Learning for Metadata Extraction from German Publications

This paper presents a method that extracts metadata from PDF documents with different layouts and styles by viewing the document as an image and achieves an average accuracy of around 90% which validates its capability to accurately extract metadata from a variety of PDF Documents with challenging templates.

Scientific evidence extraction

A new dataset, PubMed Tables One Million (PubTables-1M), and a new class of metric, grid table similarity (GriTS) are proposed, which can be used for models across multiple architectures and modalities, and addresses issues such as ambiguity and lack of consistency in the annotations.

Parsing AUC Result-Figures in Machine Learning Specific Scholarly Documents for Semantically-enriched Summarization

This paper proposes creating semantically enriched document summaries by extracting meaningful data from the results-figures specific to the evaluation metric of the area under the curve (AUC) and their associated captions from full-text documents and observes that figure specialized summaries are more comprehensive andSemantically enriched.

PubTables-1M: Towards a universal dataset and metrics for training and evaluating table extraction models

A new dataset, PubMed Tables One Million (PubTables-1M), and a new class of metric, grid table similarity (GriTS), which is nearly twice as large as the previous largest comparable dataset, contains highly-detailed structure annotations, and can be used for models across multiple architectures and modalities.

DeepPaperComposer: A Simple Solution for Training Data Preparation for Parsing Research Papers

We present DeepPaperComposer, a simple solution for preparing highly accurate (100%) training data without manual labeling to extract content from scholarly articles using convolutional neural

DeepPDF: A Deep Learning Approach to Analyzing PDFs

This paper explores the feasibility of treating these PDF documents as images and believes that by using deep learning and image analysis it may be possible to create more accurate tools for extracting information from PDF documents than those that currently exist.

Robust PDF Document Conversion Using Recurrent Neural Networks

This paper presents a novel approach to document structure recovery in PDF using recurrent neural networks to process the low-level PDF data representation directly, instead of relying on a visual re-interpretation of the rendered PDF page, as has been proposed in previous literature.

Document Domain Randomization for Deep Learning Document Layout Extraction

To the best of the knowledge, this work provides the first successful application of a deep neural network that does not rely on human-curated training samples and that only exploits graphically rendered papers for real-world paper page segmentation.

Self-Supervised Learning for Visual Summary Identification in Scientific Publications

A new benchmark dataset for selecting figures to serve as visual summaries of publications based on their abstracts is created and a self-supervised learning approach is developed, based on heuristic matching of inline references to figures with figure captions, which is able to outperform the state of the art.

Visual Summary Identification From Scientific Publications via Self-Supervised Learning

This work builds a novel benchmark data set for visual summary identification from scientific publications, which consists of papers presented at conferences from several areas of computer science, and combines a new self-supervised learning approach to learn a heuristic matching of in-text references to figures with figure captions.



A Data Driven Approach for Compound Figure Separation Using Convolutional Neural Networks

A data driven approach to separate compound figures using modern deep Convolutional Neural Networks to train the separator in an end-to-end manner is proposed, using transfer learning as well as automatically synthesizing training exemplars.

FigureSeer: Parsing Result-Figures in Research Papers

This paper introduces FigureSeer, an end-to-end framework for parsing result-figures, that enables powerful search and retrieval of results in research papers and formulates a novel graph-based reasoning approach using a CNN-based similarity metric.

PDFFigures 2.0: Mining figures from research papers

An algorithm that extracts figures, tables, and captions from documents called “PDFFigures 2.0” that analyzes the structure of individual pages by detecting captions, graphical elements, and chunks of body text, and then locates figures and tables by reasoning about the empty regions within that text.

Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks

An end-to-end, multimodal, fully convolutional network for extracting semantic structures from document images using a unified model that classifies pixels based not only on their visual appearance, as in the traditional page segmentation task, but also on the content of underlying text.

Assembling Deep Neural Networks for Medical Compound Figure Detection

This work trains multiple convolutional neural networks, long short-term memory networks, and gated recurrent unit networks on top of pre-trained word vectors to learn textual features from captions and employ deep CNNs to learn visual features from figures.

Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers

This work introduces a new dataset of 150 computer science papers along with ground truth labels for the locations of the figures, tables and captions within them and demonstrates a caption-to-figure matching component that is effective even in cases where individual captions are adjacent to multiple figures.

Automated Data Extraction from Scholarly Line Graphs

An analysis of line graphs is reported to explain the challenges of building a fully automated data extraction system and a novel curve extraction method is proposed that has an average accuracy of 82%.

Distant supervision for relation extraction without labeled data

This work investigates an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE-style algorithms, and allowing the use of corpora of any size.

Multi-Scale Multi-Task FCN for Semantic Page Segmentation and Table Detection

This work presents a page segmentation algorithm that incorporates state-of-the-art deep learning methods for segmenting three types of document elements: text blocks, tables, and figures and proposes a conditional random field (CRF) that uses features output from the semantic segmentsation and contour networks to improve upon the semantic segmentation network output.

PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search

We introduce PDFMEF, a multi-entity knowledge extraction framework for scholarly documents in the PDF format. It is implemented with a framework that encapsulates open-source extraction tools.