SelfDoc: Self-Supervised Document Representation Learning

  title={SelfDoc: Self-Supervised Document Representation Learning},
  author={Peizhao Li and Jiuxiang Gu and Jason Kuen and Vlad I. Morariu and Handong Zhao and R. Jain and Varun Manjunatha and Hongfu Liu},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Peizhao Li, Jiuxiang Gu, Hongfu Liu
  • Published 1 June 2021
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
We propose SelfDoc, a task-agnostic pre-training framework for document image understanding. Because documents are multimodal and are intended for sequential reading, our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document, and it models the contextualization between each block of content. Unlike existing document pre-training models, our model is coarse-grained instead of treating individual words as input, therefore… 

Figures and Tables from this paper

Unified Pretraining Framework for Document Understanding

UDoc, a new unified pretraining framework for document understanding, is presented, designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.

UniDoc: Unified Pretraining Framework for Document Understanding

UniDoc is a new unified pretraining framework for document understanding that learns a generic representation by making use of three self-supervised losses, encouraging the representation to model sentences, learn similarities, and align modalities.

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

LayoutLMv3 is proposed to pre-train multimodal Transformers for Document AI with unified text and image masking, and is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked.

VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

The proposed learning objective is devised between intra- and inter-modality alignment tasks, where the similarity distribution per task is computed by contracting positive sample pairs while simultaneously contrasting negative ones in the common feature represen-∗ tation space.

Test-Time Adaptation for Visual Document Understanding

DocTTA is proposed, a novel test-time adaptation approach for documents that leverages cross-modality self-supervised learning via masked visual language modeling as well as pseudo labeling to adapt models learned on a source domain to an unlabeled target domain at test time.

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

A pre-training paradigm called Bi-VLDoc is proposed, in which a bidirectional vision- language supervision strategy and a vision-language hybrid-attention mechanism are devised to fully explore and utilize the interactions between these two modalities, to learn stronger cross-modal document representations with richer semantics.

Knowing Where and What: Unified Word Block Pretraining for Document Understanding

This paper focuses on the embedding learning of word blocks containing text and layout information, and proposes UTel, a language model with Unified TExt and Layout pre-training that achieves superior performance than previous methods on various downstream tasks, though requiring no image modality.

BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

A pre-trained language model, named BROS (BERT Relying On Spatiality), is proposed that encodes relative positions of texts in 2D space and learns from unlabeled documents with area-masking strategy and shows comparable or better performance compared to previous methods on four KIE benchmarks without relying on visual features.

Multimodal Pre-training Based on Graph Attention Network for Document Understanding

GraphDoc is a multimodal graph attention-based model for various document understanding tasks that learns a generic representation from only 320k unlabeled documents via the Masked Sentence Modeling task.

ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding

Experimental results on four tasks, including information extraction and document question answering, show that the proposed method can improve the performance of multimodal Transformers based on fine-grained elements and achieve better performance with fewer parameters.



Self-Supervised Representation Learning on Document Images

A novel method for self-supervision is proposed, which makes use of the inherent multi-modality of documents (image and text), which performs better than other popularSelf-supervised methods, including supervised ImageNet pre-training, on document image classification scenarios with a limited amount of data.

Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning

In this paper, we propose a multi-task learning-based framework that utilizes a combination of self-supervised and supervised pre-training tasks to learn a generic document representation. We design

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

The LayoutLM is proposed to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents.

Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks

An end-to-end, multimodal, fully convolutional network for extracting semantic structures from document images using a unified model that classifies pixels based not only on their visual appearance, as in the traditional page segmentation task, but also on the content of underlying text.

Improving Language Understanding by Generative Pre-Training

The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model.

Evaluation of deep convolutional nets for document image classification and retrieval

A new state-of-the-art for document image classification and retrieval, using features learned by deep convolutional neural networks (CNNs), and makes available a new labelled subset of the IIT-CDIP collection, containing 400,000 document images across 16 categories.

Graph Convolution for Multimodal Information Extraction from Visually Rich Documents

This paper introduces a graph convolution based model to combine textual and visual information presented in VRDs and outperforms BiLSTM-CRF baselines by significant margins, on two real-world datasets.

Longformer: The Long-Document Transformer

Following prior work on long-sequence transformers, the Longformer is evaluated on character-level language modeling and achieves state-of-the-art results on text8 and enwik8 and pretrain Longformer and finetune it on a variety of downstream tasks.