LayoutReader: Pre-training of Text and Layout for Reading Order Detection

  title={LayoutReader: Pre-training of Text and Layout for Reading Order Detection},
  author={Zilong Wang and Yiheng Xu and Lei Cui and Jingbo Shang and Furu Wei},
Reading order detection is the cornerstone to understanding visually-rich documents (e.g., receipts and forms). Unfortunately, no existing work took advantage of advanced deep learning models because it is too laborious to annotate a large enough dataset. We observe that the reading order of WORD documents is embedded in their XML metadata; meanwhile, it is easy to convert WORD documents to PDFs or images. Therefore, in an automated manner, we construct ReadingBank, a benchmark dataset that… 

Figures and Tables from this paper

BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

A pre-trained language model, named BROS (BERT Relying On Spatiality), is proposed that encodes relative positions of texts in 2D space and learns from unlabeled documents with area-masking strategy and shows comparable or better performance compared to previous methods on four KIE benchmarks without relying on visual features.

XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding

A robust layout-aware multimodal network named XYLayoutLM is proposed to capture and leverage rich layout information from proper reading orders pro-duced by the Augmented XY Cut to achieve competitive results on document understanding tasks.

Towards Few-shot Entity Recognition in Document Images: A Label-aware Sequence-to-Sequence Framework

This paper builds an entity recognition model requiring only a few shots of annotated document images and develops a novel label-aware seq2seq framework, LASER, which refines the label semantics by updating the label surface name representations and also strengthens the label-region correlation.

Document AI: Benchmarks, Models and Applications

Early-stage heuristic rule-based document analysis, statistical machine learning algorithms, and deep learning approaches especially pre-training methods are introduced, and future directions for Document AI research are looked into.

DavarOCR: A Toolbox for OCR and Multi-Modal Document Understanding

DavarOCR is an open-source toolbox for OCR and document understanding tasks that has relatively more complete support for the sub-tasks of the cutting-edge technology of document understanding.

A comparative Study of Handwritten Devanagari Script Character Recognition Techniques

A comparative study of four different classifiers and two different feature extraction techniques have been proposed in this paper, which shows that Multi-Layer Perceptron, K-Nearest Neighbor, Support Vector Machine, and Random Forest algorithms are used as classifiers whereas Convolutional Neural Network and Histogram of Oriented Gradients are use as feature extraction Techniques.

Towards Optimizing OCR for Accessibility

Visual cues such as structure, emphasis, and icons play an important role in efficient information foraging by sighted individuals and make for a pleasurable reading experience. Blind, low-vision and

Relational Representation Learning in Visually-Rich Documents

A novel contrastive learning task named Relational Consistency Modeling (RCM), which harnesses the fact that existing relations should be consistent in differently augmented positive views, provides relational representations which are more compatible to the urgent need of downstream tasks, even without any knowledge about the exact definition of relation.



LayoutLM: Pre-training of Text and Layout for Document Image Understanding

The LayoutLM is proposed to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents.

Machine Learning for Reading Order Detection in Document Image Understanding

The problem of detecting the reading order relationship between components of a logical structure is investigated, typically denoted as document layout analysis, which involves several steps including preprocessing, page decomposition, classification of segments according to content type and hierarchical organization on the basis of perceptual meaning.

An End-to-End OCR Text Re-organization Sequence Learning for Rich-Text Detail Image Comprehension

A Graph Neural Network is created with an attention map to encode the text blocks with visual layout features, with which an attention-based sequence decoder inspired by the Pointer Network and a Sinkhorn global optimization will reorder the OCR text into a proper sequence.

PubLayNet: Largest Dataset Ever for Document Layout Analysis

The PubLayNet dataset for document layout analysis is developed by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central and demonstrated that deep neural networks trained on Pub LayNet accurately recognize the layout of scientific articles.

The Significance of Reading Order in Document Recognition and Its Evaluation

A novel evaluation approach that responds to the evaluation of reading order results generated by layout analysis methods by incorporating region correspondence analysis is proposed and a sophisticated reading order representation scheme is presented and used by the system.

A Data Mining Approach to Reading Order Detection

This paper investigates the problem of detecting the reading order of layout components by resorting to a data mining approach which acquires the domain specific knowledge from a set of training examples and induces a probabilistic classifier based on the Bayesian framework which is used for reconstructing either single or multiple chains of layout component.

TableBank: Table Benchmark for Image-based Table Detection and Recognition

This work presents TableBank, a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet that contains 417K high quality labeled tables and builds several strong baselines using state-of-the-art models with deep neural networks.

Extracting Scientific Figures with Distantly Supervised Neural Networks

This paper induces high-quality training labels for the task of figure extraction in a large number of scientific documents, with no human intervention, and uses this dataset to train a deep neural network for end-to-end figure detection, yielding a model that can be more easily extended to new domains compared to previous work.

Unified Language Model Pre-training for Natural Language Understanding and Generation

A new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks that compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks.

Abstract argumentation for reading order detection

Experimental results show that the automatic strategy for identifying the correct reading order of a document page's components based on abstract argumentation is effective in more complex cases, and requires less background knowledge, than previous solutions that have been proposed in the literature.