LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

@article{Huang2022LayoutLMv3PF,
  title={LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking},
  author={Yupan Huang and Tengchao Lv and Lei Cui and Yutong Lu and Furu Wei},
  journal={ArXiv},
  year={2022},
  volume={abs/2204.08387}
}
Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking… 

Figures and Tables from this paper

Knowing Where and What: Unified Word Block Pretraining for Document Understanding
TLDR
This paper focuses on the embedding learning of word blocks containing text and layout information, and proposes UTel, a language model with Unified TExt and Layout pre-training that achieves superior performance than previous methods on various downstream tasks, though requiring no image modality.
DiT: Self-supervised Pre-training for Document Image Transformer
TLDR
This paper proposes DiT, a self-supervised pre-trained DiT model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised coun-terparts ever exist due to the lack of human-labeled document images.
VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification
TLDR
The proposed learning objective is devised between intra- and inter-modality alignment tasks, where the similarity distribution per task is computed by contracting positive sample pairs while simultaneously contrasting negative ones in the common feature represen-∗ tation space.
Test-Time Adaptation for Visual Document Understanding
TLDR
DocTTA is proposed, a novel test-time adaptation approach for documents that leverages cross-modality self-supervised learning via masked visual language modeling as well as pseudo labeling to adapt models learned on a source domain to an unlabeled target domain at test time.
Clean your desk! Transformers for unsupervised clustering of document images
TLDR
It is found that LayoutLMv2 generally outperforms LayoutLM, although LayoutLM may have advantages for text-heavy documents, and surprisingly, the [CLS] token output is not always the best document representation, at least in the context of clustering.

References

SHOWING 1-10 OF 59 REFERENCES
LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding
TLDR
LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks.
UNITER: UNiversal Image-TExt Representation Learning
TLDR
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
LayoutLM: Pre-training of Text and Layout for Document Image Understanding
TLDR
The LayoutLM is proposed to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents.
SelfDoc: Self-Supervised Document Representation Learning
TLDR
This work proposes SelfDoc, a task-agnostic pre-training framework for document image understanding that benefits from self-supervised pre- training on documents without requiring annotations by a feature masking training strategy, and proposes a novel modality-adaptive attention mechanism for multimodal feature fusion by adaptively emphasizing language and vision signals.
BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents
TLDR
A pre-trained language model, named BROS (BERT Relying On Spatiality), is proposed that encodes relative positions of texts in 2D space and learns from unlabeled documents with area-masking strategy and shows comparable or better performance compared to previous methods on four KIE benchmarks without relying on visual features.
LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding
TLDR
A novel layout-aware multimodal hierarchical framework, LAMPreT, is proposed and evaluated to model the blocks and the whole document and shows the effectiveness of the proposed hierarchical architecture as well as pretraining techniques.
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
TLDR
The LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUN dataset and aims to bridge the language barriers for visually-rich document understanding.
DiT: Self-supervised Pre-training for Document Image Transformer
TLDR
This paper proposes DiT, a self-supervised pre-trained DiT model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised coun-terparts ever exist due to the lack of human-labeled document images.
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
TLDR
This paper proposes SOHO to "Seeing Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner, and does not require bounding box annotations which enables inference 10 times faster than region-based approaches.
UniDoc: Unified Pretraining Framework for Document Understanding
TLDR
UniDoc is a new unified pretraining framework for document understanding that learns a generic representation by making use of three self-supervised losses, encouraging the representation to model sentences, learn similarities, and align modalities.
...
...