Spatial Dependency Parsing for Semi-Structured Document Information Extraction

@inproceedings{Hwang2021SpatialDP,
  title={Spatial Dependency Parsing for Semi-Structured Document Information Extraction},
  author={Wonseok Hwang and Jinyeong Yim and Seunghyun Park and Sohee Yang and Minjoon Seo},
  booktitle={FINDINGS},
  year={2021}
}
Information Extraction (IE) for semi-structured document images is often approached as a sequence tagging problem by classifying each recognized input token into one of the IOB (Inside, Outside, and Beginning) categories. However, such problem setup has two inherent limitations that (1) it cannot easily handle complex spatial relationships and (2) it is not suitable for highly structured information, which are nevertheless frequently observed in real-world document images. To tackle these… Expand

Figures and Tables from this paper

StrucTexT: Structured Text Understanding with Multi-Modal Transformers
  • Yulin Li, Yuxi Qian, +7 authors Errui Ding
  • Computer Science
  • ArXiv
  • 2021
TLDR
This paper proposes a unified framework named StrucTexT, which is flexible and effective for handling both sub-tasks, and introduces a segment-token aligned encoder to deal with the entity labeling and entity linking tasks at different levels of granularity. Expand
DeepCPCFG: Deep Learning and Context Free Grammars for End-to-End Information Extraction
TLDR
This work proposes Deep Conditional Probabilistic Context Free Grammars (DeepCPCFG) to parse two-dimensional complex documents and uses Recursive Neural Networks to create an end-to-end system for finding the most probable parse that represents the structured information to be extracted. Expand
BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents
TLDR
A pre-trained language model, named BROS (BERT Relying On Spatiality), is proposed that encodes relative positions of texts in 2D space and learns from unlabeled documents with area-masking strategy and shows comparable or better performance compared to previous methods on four KIE benchmarks without relying on visual features. Expand
A Span Extraction Approach for Information Extraction on Visually-Rich Documents
TLDR
A new query-based IE model that employs span extraction instead of using the common sequence labeling approach is introduced and a new training task focusing on modelling the relationships among semantic entities within a document is proposed. Expand
Cost-effective End-to-end Information Extraction for Semi-structured Document Images
TLDR
By carefully formulating document IE as a sequence generation task, it is shown that a single end-to-end IE system can be built and still achieve competent performance. Expand
DocFormer: End-to-End Transformer for Document Understanding
TLDR
DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer which makes it easy for the model to correlate text to visual tokens and vice versa. Expand
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information,Expand
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
TLDR
This paper presents LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model architectures and pre- Training tasks are leveraged and a spatial-aware selfattention mechanism is integrated into the Transformer architecture. Expand
ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents
TLDR
This paper proposes a new multi-modal backbone network by concatenating a BERTgrid to an intermediate layer of a CNN model, to generate a more powerful grid-based document representation, named ViBERTgrid, which has achieved state-of-the-art performance on real-world datasets. Expand
A Survey of Deep Learning Approaches for OCR and Document Understanding
TLDR
Different techniques for document understanding for documents written in English are reviewed and methodologies present in literature are consolidated to act as a jumping-off point for researchers exploring this area. Expand

References

SHOWING 1-10 OF 37 REFERENCES
FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents
TLDR
This work presents a new dataset for form understanding in noisy scanned documents (FUNSD) that aims at extracting and structuring the textual content of forms, and is the first publicly available dataset with comprehensive annotations to address FoUn task. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
CORD: A Consolidated Receipt Dataset for Post-OCR Parsing
TLDR
A consolidated dataset for receipt parsing is published, which consists of thousands of Indonesian receipts, which contains images and box/text annotations for OCR, and multi-level semantic labels for parsing. Expand
Post-OCR parsing: building simple and robust parser via BIO tagging
TLDR
This work presents POST OCR TAGGING BASED PARSER (POT), a simple and robust parser that can parse visually embedded texts by BIO-tagging the output of optical character recognition (OCR) task. Expand
Cost-effective End-to-end Information Extraction for Semi-structured Document Images
TLDR
By carefully formulating document IE as a sequence generation task, it is shown that a single end-to-end IE system can be built and still achieve competent performance. Expand
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
We address the challenging problem of Natural Language Comprehension beyond plain-text documents by introducing the TILT neural network architecture which simultaneously learns layout information,Expand
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
TLDR
This paper presents LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model architectures and pre- Training tasks are leveraged and a spatial-aware selfattention mechanism is integrated into the Transformer architecture. Expand
PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks
TLDR
PICK is introduced, a framework that is effective and robust in handling complex documents layout for KIE by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity. Expand
Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution
TLDR
This paper proposes a robust visual information extraction system (VIES) towards real-world scenarios, which is an unified end-to-end trainable framework for simultaneous text detection, recognition and information extraction by taking a single document image as input and outputting the structured information. Expand
BROS: A PRE-TRAINED LANGUAGE MODEL
  • 2020
Understanding document from their visual snapshots is an emerging and challenging problem that requires both advanced computer vision and NLP methods. Although the recent advance in OCR enables theExpand
...
1
2
3
4
...