Extracting Scientific Figures with Distantly Supervised Neural Networks

@article{Siegel2018ExtractingSF,
  title={Extracting Scientific Figures with Distantly Supervised Neural Networks},
  author={Noah Siegel and Nicholas Lourie and R. Power and Waleed Ammar},
  journal={Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries},
  year={2018}
}
Non-textual components such as charts, diagrams and tables provide key information in many scientific documents, but the lack of large labeled datasets has impeded the development of data-driven methods for scientific figure extraction. [...] Key Method We share the resulting dataset of over 5.5 million induced labels---4,000 times larger than the previous largest figure extraction dataset---with an average precision of 96.8%, to enable the development of modern data-driven methods for this task.Expand
MexPub: Deep Transfer Learning for Metadata Extraction from German Publications
TLDR
This paper presents a method that extracts metadata from PDF documents with different layouts and styles by viewing the document as an image with an average accuracy of around 90% which validates its capability to accurately extract metadata from a variety of PDF Documents with challenging templates. Expand
Document Domain Randomization for Deep Learning Document Layout Extraction
TLDR
This work presents document domain randomization (DDR), the first successful transfer of CNNs trained only on graphically rendered pseudo-paper pages to real-world document segmentation, and shows that high-fidelity semantic information is not necessary to label semantic classes but style mismatch between train and test can lower model accuracy. Expand
DeepPaperComposer: A Simple Solution for Training Data Preparation for Parsing Research Papers
We present DeepPaperComposer, a simple solution for preparing highly accurate (100%) training data without manual labeling to extract content from scholarly articles using convolutional neuralExpand
DeepPDF: A Deep Learning Approach to Analyzing PDFs
Scientific publications contain a plethora of important information, not only for researchers but also for their managers and institutions. Many researchers try to collect and extract thisExpand
Robust PDF Document Conversion Using Recurrent Neural Networks
TLDR
This paper presents a novel approach to document structure recovery in PDF using recurrent neural networks to process the low-level PDF data representation directly, instead of relying on a visual re-interpretation of the rendered PDF page, as has been proposed in previous literature. Expand
A Survey of Graphical Page Object Detection with Deep Neural Networks
TLDR
This work outlines and summarizes the deep learning approaches for detecting graphical page objects in the document images and discusses the most relevant deep learning-based approaches and state-of-the-art graphical page object detection in document images. Expand
Self-Supervised Learning for Visual Summary Identification in Scientific Publications
TLDR
A new benchmark dataset for selecting figures to serve as visual summaries of publications based on their abstracts is created and a self-supervised learning approach is developed, based on heuristic matching of inline references to figures with figure captions, which is able to outperform the state of the art. Expand
Visual Summary Identification From Scientific Publications via Self-Supervised Learning
The exponential growth of scientific literature yields the need to support users to both effectively and efficiently analyze and understand the some body of research work. This exploratory processExpand
Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents
TLDR
A set of enhancements to the previously proposed algorithm search engine AlgorithmSeer are presented, proposing a set of methods to automatically identify and extract algorithmic pseudo-codes and the sentences that convey related algorithmic metadata using aSet of machine-learning techniques. Expand
ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations
TLDR
ScanBank is a new dataset containing 10 thousand scanned page images, manually labeled by humans as to the presence of the 3.3 thousand figures or tables found therein, used to train a deep neural network model based on YOLOv5 to accurately extract figures and tables from scanned ETDs. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 30 REFERENCES
A Data Driven Approach for Compound Figure Separation Using Convolutional Neural Networks
TLDR
A data driven approach to separate compound figures using modern deep Convolutional Neural Networks to train the separator in an end-to-end manner is proposed, using transfer learning as well as automatically synthesizing training exemplars. Expand
FigureSeer: Parsing Result-Figures in Research Papers
TLDR
This paper introduces FigureSeer, an end-to-end framework for parsing result-figures, that enables powerful search and retrieval of results in research papers and formulates a novel graph-based reasoning approach using a CNN-based similarity metric. Expand
PDFFigures 2.0: Mining figures from research papers
TLDR
An algorithm that extracts figures, tables, and captions from documents called “PDFFigures 2.0” that analyzes the structure of individual pages by detecting captions, graphical elements, and chunks of body text, and then locates figures and tables by reasoning about the empty regions within that text. Expand
Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks
TLDR
An end-to-end, multimodal, fully convolutional network for extracting semantic structures from document images using a unified model that classifies pixels based not only on their visual appearance, as in the traditional page segmentation task, but also on the content of underlying text. Expand
Assembling Deep Neural Networks for Medical Compound Figure Detection
TLDR
This work trains multiple convolutional neural networks, long short-term memory networks, and gated recurrent unit networks on top of pre-trained word vectors to learn textual features from captions and employ deep CNNs to learn visual features from figures. Expand
Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers
TLDR
This work introduces a new dataset of 150 computer science papers along with ground truth labels for the locations of the figures, tables and captions within them and demonstrates a caption-to-figure matching component that is effective even in cases where individual captions are adjacent to multiple figures. Expand
Automated Data Extraction from Scholarly Line Graphs
Line graphs are ubiquitous in scholarly papers. They are usually generated from a data table and often used to compare performances of various methods. The data in these figures can not be accessed.Expand
Distant supervision for relation extraction without labeled data
TLDR
This work investigates an alternative paradigm that does not require labeled corpora, avoiding the domain dependence of ACE-style algorithms, and allowing the use of corpora of any size. Expand
Multi-Scale Multi-Task FCN for Semantic Page Segmentation and Table Detection
TLDR
This work presents a page segmentation algorithm that incorporates state-of-the-art deep learning methods for segmenting three types of document elements: text blocks, tables, and figures and proposes a conditional random field (CRF) that uses features output from the semantic segmentsation and contour networks to improve upon the semantic segmentation network output. Expand
PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search
We introduce PDFMEF, a multi-entity knowledge extraction framework for scholarly documents in the PDF format. It is implemented with a framework that encapsulates open-source extraction tools.Expand
...
1
2
3
...