Capturing Logical Structure of Visually Structured Documents with Multimodal Transition Parser

@article{Koreeda2021CapturingLS,
  title={Capturing Logical Structure of Visually Structured Documents with Multimodal Transition Parser},
  author={Yuta Koreeda and Christopher D. Manning},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.00150}
}
While many NLP pipelines assume raw, clean texts, many texts we encounter in the wild, including a vast majority of legal documents, are not so clean, with many of them being visually structured documents (VSDs) such as PDFs. Conventional preprocessing tools for VSDs mainly focused on word segmentation and coarse layout analysis, whereas fine-grained logical structure analysis (such as identifying paragraph boundaries and their hierarchies) of VSDs is underexplored. To that end, we proposed to… 

Figures and Tables from this paper

ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts
TLDR
This work proposes documentlevel natural language inference for contracts, a novel, real-world application of NLI that addresses problems of reviewing contracts and introduces a strong baseline, which models evidence identification as multi-label classification over spans instead of trying to predict start and end tokens.

References

SHOWING 1-10 OF 22 REFERENCES
A Benchmark and Evaluation for Text Extraction from PDF
TLDR
This paper shows how to construct a high-quality benchmark of principally arbitrary size from parallel TeX and PDF data, and establishes a set of criteria for a clean and independent assessment of the semantic abilities of a given extraction tool.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Visual Detection with Context for Document Layout Analysis
TLDR
A work in progress method to visually segment key regions of scientific articles using an object detection technique augmented with contextual features, and a novel dataset of region-labeled articles, and ongoing work on further improvements are discussed.
Deep Biaffine Attention for Neural Dependency Parsing
TLDR
This paper uses a larger but more thoroughly regularized parser than other recent BiLSTM-based approaches, with biaffine classifiers to predict arcs and labels, and shows which hyperparameter choices had a significant effect on parsing accuracy, allowing it to achieve large gains over other graph-based approach.
Estimating Legal Document Structure by Considering Style Information and Table of Contents
TLDR
This paper presents a preprocessing method to estimate document structure from documents without a common structure, which follows rule-based approach, and consists of three algorithms based on style information, such as bold font, which summarizes the document’s structure.
AMR Parsing as Sequence-to-Graph Transduction
TLDR
This work proposes an attention-based model that treats AMR parsing as sequence-to-graph transduction, and it can be effectively trained with limited amounts of labeled AMR data.
Automatic Paragraph Identification: A Study across Languages and Domains
TLDR
A machine learning approach which exploits textual and discourse cues is proposed which achieves an accuracy that is significantly higher than the best baseline and comes to within 6% of human performance.
FinDSE@FinTOC-2019 Shared Task
TLDR
A supervised learning approach making use of linguistic, semantic and morphological features to classify a text block as title or non title is proposed.
Sentence Boundary Detection and the Problem with the U.S.
Sentence Boundary Detection is widely used but often with outdated tools. We discuss what makes it difficult, which features are relevant, and present a fully statistical system, now publicly
PDFdigest: an Adaptable Layout-Aware PDF-to-XML Textual Content Extractor for Scientific Articles
Comunicacio presentada a la Language Resources and Evaluation Conference (LREC) 2018, celebrada els dies 7 a 12 de maig de 2018 a Miyazaki, Japo.
...
1
2
3
...