PAWLS: PDF Annotation With Labels and Structure

@article{Neumann2021PAWLSPA,
  title={PAWLS: PDF Annotation With Labels and Structure},
  author={Mark Neumann and Zejiang Shen and Sam Skjonsberg},
  journal={ArXiv},
  year={2021},
  volume={abs/2101.10281}
}
Adobe’s Portable Document Format (PDF) is a popular way of distributing view-only documents with a rich visual markup. This presents a challenge to NLP practitioners who wish to use the information contained within PDF documents for training models or data analysis, because annotating these documents is difficult. In this paper, we present PDF Annotation with Labels and Structure (PAWLS), a new annotation tool designed specifically for the PDF document format. PAWLS is particularly suited for… 

Figures and Tables from this paper

VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups
TLDR
New methods that explicitly model VIsual LAyout (VILA) groups, that is, text lines or text blocks, to further improve performance are introduced and it is shown that simply inserting special tokens denoting layout group boundaries into model inputs can lead to a 1.9% Macro F1 improvement in token classification.
Incorporating Visual Layout Structures for Scientific Text Classification
TLDR
This work introduces new methods for incorporating VIsual LAyout (VILA) structures, e.g., the grouping of page texts into text lines or text blocks, into language models to further improve performance and designs a hierarchical model, H-VILA, that encodes the text based on layout structures.
Development and Evaluation of a Tool for Assisting Content Creators in Making PDF Files More Accessible
TLDR
The approaches taken in Ally improve the ability to create accessible PDFs efficiently and accurately for the four important aspects studied, but future work will need to incorporate additional functionality, related to remediating alt text, forms, and other aspects of PDF accessibility.
Infrastructure for Rapid Open Knowledge Network Development
TLDR
A National Science Foundation Convergence Accelerator project is described to build a set of Knowledge Network Programming Infrastructure systems to address the issue of frustratingly slow building, using, and scaling large knowledge networks.
Infrastructure for rapid open knowledge network development
TLDR
A National Science Foundation Convergence Accelerator project is described to build a set of Knowledge Network Programming Infrastructure systems to address the issue of frustratingly slow building, using, and scaling large knowledge networks.

References

SHOWING 1-10 OF 27 REFERENCES
PDFAnno: a Web-based Linguistic Annotation Tool for PDF Documents
We present PDFAnno, a web-based linguistic annotation tool for PDF documents. PDF has become widespread standard for various types of publications, however, current tools for linguistic annotation
SideNoter: Scholarly Paper Browsing System based on PDF Restructuring and Text Annotation
TLDR
This system provides ways to extract natural language sentences from PDF files together with their logical structures, and also to map arbitrary textual spans to their corresponding regions on page images, and is planned to make widely available to NLP researchers.
SLATE: A Super-Lightweight Annotation Tool for Experts
TLDR
SLATE is a new annotation tool that is designed to fill the niche of a lightweight interface for users with a terminal-based workflow, and has already been used to annotate two corpora.
Automatic Annotation Suggestions and Custom Annotation Layers in WebAnno
TLDR
This paper extends WebAnno an open-source web-based annotation tool and tightly integrate a generic machine learning component for automatic annotation suggestions of span annotations, and shows that automatic annotations suggestions, combined with the split-pane UI concept, significantly reduces annotation time.
- like interactive curation system for document triage and literature curation
TLDR
PubTator was developed to have a look-and-feel similar to PubMed, thus minimizing the learning efforts required for new users, and allows its users to do semantic search besides the traditional keyword based search, a novel feature not available in PubMed.
SciREX: A Challenge Dataset for Document-Level Information Extraction
TLDR
SciREX is introduced, a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level N-ary relation identification from scientific articles, and a neural model is developed as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
PubLayNet: Largest Dataset Ever for Document Layout Analysis
TLDR
The PubLayNet dataset for document layout analysis is developed by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central and demonstrated that deep neural networks trained on Pub LayNet accurately recognize the layout of scientific articles.
TALEN: Tool for Annotation of Low-resource ENtities
TLDR
A small user study is conducted to compare against a popular annotation tool, showing that TALEN achieves higher precision and recall against ground-truth annotations, and that users strongly prefer it over the alternative.
brat: a Web-based Tool for NLP-Assisted Text Annotation
TLDR
The brat rapid annotation tool (BRAT) is introduced, an intuitive web-based tool for text annotation supported by Natural Language Processing (NLP) technology and an evaluation of annotation assisted by semantic class disambiguation on a multicategory entity mention annotation task, showing a 15% decrease in total annotation time.
Knowtator: A Protégé plug-in for annotated corpus construction
A general-purpose text annotation tool called Knowtator is introduced. Knowtator facilitates the manual creation of annotated corpora that can be used for evaluating or training a variety of natural
...
1
2
3
...