Corpus ID: 235265639

Incorporating Visual Layout Structures for Scientific Text Classification

  title={Incorporating Visual Layout Structures for Scientific Text Classification},
  author={Zejiang Shen and Kyle Lo and Lucy Lu Wang and Bailey Kuehl and Daniel S. Weld and Doug Downey},
Classifying the core textual components of a scientific paper—title, author, body text, etc.— is a critical first step in automated scientific document understanding. Previous work has shown how using elementary layout information, i.e., each token’s 2D position on the page, leads to more accurate classification. We introduce new methods for incorporating VIsual LAyout (VILA) structures, e.g., the grouping of page texts into text lines or text blocks, into language models to further improve… Expand

Figures and Tables from this paper

SciA11y: Converting Scientific Papers to Accessible HTML
  • Lucy Lu Wang, Isabel Cachola, +7 authors Daniel Weld
  • 2021
We present SciA11y, a system that renders inaccessible scientific paper PDFs into HTML. SciA11y uses machine learning models to extract and understand the content of scientific PDFs, and reorganizesExpand


Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout
A new task is introduced (named Kleister) with two new datasets to encourage progress on deeper and more complex Information Extraction (IE) and Pipeline method is proposed as a text-only baseline with different Named Entity Recognition architectures (Flair, BERT, RoBERTa). Expand
Layout-Aware Text Representations Harm Clustering Documents by Type
This work finds experimentally that BERT significantly outperforms LayoutLM on this task and analyzes clusters to show where layout awareness is an asset and where it is a liability. Expand
Robust PDF Document Conversion Using Recurrent Neural Networks
This paper presents a novel approach to document structure recovery in PDF using recurrent neural networks to process the low-level PDF data representation directly, instead of relying on a visual re-interpretation of the rendered PDF page, as has been proposed in previous literature. Expand
PubLayNet: Largest Dataset Ever for Document Layout Analysis
The PubLayNet dataset for document layout analysis is developed by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central and demonstrated that deep neural networks trained on Pub LayNet accurately recognize the layout of scientific articles. Expand
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
This paper presents LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model architectures and pre- Training tasks are leveraged and a spatial-aware selfattention mechanism is integrated into the Transformer architecture. Expand
PAWLS: PDF Annotation With Labels and Structure
This paper presents PDF Annotation with Labels and Structure (PAWLS), a new annotation tool designed specifically for the PDF document format, particularly suited for mixed-mode annotation and scenarios in which annotators require extended context to annotate accurately. Expand
Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching
The proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT and is able to increase maximum input text length from 512 to 2048. Expand
CERMINE: automatic extraction of structured metadata from scientific literature
The overall workflow architecture of CERMINE is outlined, details about individual steps implementations are provided and the evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types. Expand
Extracting Scientific Figures with Distantly Supervised Neural Networks
This paper induces high-quality training labels for the task of figure extraction in a large number of scientific documents, with no human intervention, and uses this dataset to train a deep neural network for end-to-end figure detection, yielding a model that can be more easily extended to new domains compared to previous work. Expand
Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale
This paper presents a modular, cloud-based platform to ingest documents at scale, called the Corpus Conversion Service (CCS), which implements a pipeline which allows users to parse and annotate documents and ultimately convert any type of PDF or bitmap-documents to a structured content representation format. Expand