LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

@article{Xu2021LayoutLMv2MP,
  title={LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding},
  author={Yang Xu and Yiheng Xu and Tengchao Lv and Lei Cui and Furu Wei and Guoxin Wang and Yijuan Lu and Dinei A. F. Flor{\^e}ncio and Cha Zhang and Wanxiang Che and Min Zhang and Lidong Zhou},
  journal={ArXiv},
  year={2021},
  volume={abs/2012.14740}
}
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual… Expand

Figures and Tables from this paper

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
TLDR
The LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUN dataset and aims to bridge the language barriers for visually-rich document understanding. Expand
BROS: A Layout-Aware Pre-trained Language Model for Understanding Documents
TLDR
This paper introduces a pre-trained language model, BERT Relying On Spatiality (BROS), which effectively utilizes the information included in individual text blocks and their layouts and introduces a general-purpose parser that can be combined with BROS to extract key information even when there is no order information between text blocks. Expand
MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding
TLDR
This paper proposes MarkupLM for document understanding tasks with markup languages as the backbone such as HTML/XML-based documents, where text and markup information is jointly pre-trained. Expand
Skim-Attention: Learning to Focus via Document Layout
TLDR
Skim-Attention, a new attention mechanism that takes advantage of the structure of the document and its layout, and can be used off-the-shelf as a mask for any Pre-trained Language Model, allowing to improve their performance while restricting attention. Expand
DocFormer: End-to-End Transformer for Document Understanding
TLDR
DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer which makes it easy for the model to correlate text to visual tokens and vice versa. Expand
StrucTexT: Structured Text Understanding with Multi-Modal Transformers
  • Yulin Li, Yuxi Qian, +7 authors Errui Ding
  • Computer Science
  • ACM Multimedia
  • 2021
TLDR
This paper proposes a unified framework named StrucTexT, which is flexible and effective for handling both sub-tasks, and introduces a segment-token aligned encoder to deal with the entity labeling and entity linking tasks at different levels of granularity. Expand
ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents
TLDR
This paper proposes a new multi-modal backbone network by concatenating a BERTgrid to an intermediate layer of a CNN model, to generate a more powerful grid-based document representation, named ViBERTgrid, which has achieved state-of-the-art performance on real-world datasets. Expand
BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents
TLDR
A pre-trained language model, named BROS (BERT Relying On Spatiality), is proposed that encodes relative positions of texts in 2D space and learns from unlabeled documents with area-masking strategy and shows comparable or better performance compared to previous methods on four KIE benchmarks without relying on visual features. Expand
Information Extraction from Visually Rich Documents with Font Style Embeddings
TLDR
This work proposes to challenge the usage of computer vision in the case where both token style and visual representation are available (i.e native PDF documents) and demonstrates that using token style attributes based embedding instead of a raw visual embedding in LayoutLM model is beneficial. Expand
Data-Efficient Information Extraction from Documents with Pre-trained Language Models
TLDR
LayoutLM, a pre-trained model recently proposed for encoding 2D documents, reveals a high sample-efficiency when fine-tuned on public and real-world Information Extraction (IE) datasets, thus indicating valuable knowledge transfer abilities. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 47 REFERENCES
BROS: A PRE-TRAINED LANGUAGE MODEL
  • 2020
Understanding document from their visual snapshots is an emerging and challenging problem that requires both advanced computer vision and NLP methods. Although the recent advance in OCR enables theExpand
Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models
TLDR
This work studies the problem of information extraction from visually rich documents (VRDs) and presents a model that combines the power of large pre-trained language models and graph neural networks to efficiently encode both textual and visual information in business documents. Expand
Deterministic Routing between Layout Abstractions for Multi-Scale Classification of Visually Rich Documents
TLDR
A spatial pyramid model to extract highly discriminative multi-scale feature descriptors from a visually rich document by leveraging the inherent hierarchy of its layout and a deterministic routing scheme for accelerating end-to-end inference by utilizing the spatial Pyramid model are proposed. Expand
Graph Convolution for Multimodal Information Extraction from Visually Rich Documents
TLDR
This paper introduces a graph convolution based model to combine textual and visual information presented in VRDs and outperforms BiLSTM-CRF baselines by significant margins, on two real-world datasets. Expand
Fast CNN-Based Document Layout Analysis
TLDR
This paper takes advantage of the inherently one-dimensional pattern observed in text and table blocks to reduce the dimension analysis from bi-dimensional documents images to 1D signatures, improving significantly the overall performance. Expand
Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks
TLDR
An end-to-end, multimodal, fully convolutional network for extracting semantic structures from document images using a unified model that classifies pixels based not only on their visual appearance, as in the traditional page segmentation task, but also on the content of underlying text. Expand
Visual Detection with Context for Document Layout Analysis
TLDR
A work in progress method to visually segment key regions of scientific articles using an object detection technique augmented with contextual features, and a novel dataset of region-labeled articles, and ongoing work on further improvements are discussed. Expand
ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages
TLDR
This work proposes a solution for “zero-shot” open-domain relation extraction from webpages with a previously unseen template, including from websites with little overlap with existing sources of knowledge for distant supervision and websites in entirely new subject verticals. Expand
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to aExpand
Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout
TLDR
A new task is introduced (named Kleister) with two new datasets to encourage progress on deeper and more complex Information Extraction (IE) and Pipeline method is proposed as a text-only baseline with different Named Entity Recognition architectures (Flair, BERT, RoBERTa). Expand
...
1
2
3
4
5
...