LAMBERT: Layout-Aware Language Modeling for Information Extraction

@inproceedings{Garncarek2021LAMBERTLL,
  title={LAMBERT: Layout-Aware Language Modeling for Information Extraction},
  author={Lukasz Garncarek and Rafal Powalski and Tomasz Stanisławek and Bartosz Topolski and Piotr Halama and Michał P. Turski and Filip Grali'nski},
  booktitle={ICDAR},
  year={2021}
}
We introduce a simple new approach to the problem of understanding documents where non-trivial layout influences the local semantics. To this end, we modify the Transformer encoder architecture in a way that allows it to use layout features obtained from an OCR system, without the need to re-learn language semantics from scratch. We only augment the input of the model with the coordinates of token bounding boxes, avoiding, in this way, the use of raw images. This leads to a layout-aware… 

ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents

This paper proposes a new multi-modal backbone network by concatenating a BERTgrid to an intermediate layer of a CNN model, to generate a more powerful grid-based document representation, named ViBERTgrid, which has achieved state-of-the-art performance on real-world datasets.

LayoutXLM vs. GNN: An Empirical Evaluation of Relation Extraction for Documents

This paper investigates the Relation Extraction task in documents by benchmarking two different neural network models: a multimodal language model (LayoutXLM) and a Graph Neural Network: Edge

ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding

Experimental results on four tasks, including information extraction and document question answering, show that the proposed method can improve the performance of multimodal Transformers based on fine-grained elements and achieve better performance with fewer parameters.

TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents

A end-to-end information extraction framework from visually rich documents, where text reading and information extraction can reinforce each other via a well-designed multi-modal context block is proposed.

Clean your desk! Transformers for unsupervised clustering of document images

It is found that LayoutLMv2 generally outperforms LayoutLM, although LayoutLM may have advantages for text-heavy documents, and surprisingly, the [CLS] token output is not always the best document representation, at least in the context of clustering.

Business Document Information Extraction: Towards Practical Benchmarks

There is a lack of relevant datasets and benchmarks for Document IE on semi-structured business documents as their content is typically legally protected or sensitive, and potential sources of available documents including synthetic data are discussed.

STable: Table Generation Framework for Encoder-Decoder Models

A framework for text-to-table neural models applicable to problems such as extraction of line items, joint entity and relation extraction, or knowledge base population is proposed, which establishes state-of-the-art results on several challenging datasets.

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

LayoutLMv3 is proposed to pre-train multimodal Transformers for Document AI with unified text and image masking, and is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked.

Towards Few-shot Entity Recognition in Document Images: A Label-aware Sequence-to-Sequence Framework

This paper builds an entity recognition model requiring only a few shots of annotated document images and develops a novel label-aware seq2seq framework, LASER, which refines the label semantics by updating the label surface name representations and also strengthens the label-region correlation.

FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction

This work proposes FormNet, a structure-aware sequence model to mitigate the suboptimal serialization of forms, which designs Rich Attention that leverages the spatial relationship between tokens in a form for more precise attention score calculation and constructs Super-Tokens for each word.

References

SHOWING 1-10 OF 34 REFERENCES

LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding

LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks.

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

The LayoutLM is proposed to jointly model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents.

Leaderboard of the Information Extraction Task, Robust Reading Competition

  • https://rrc.cvc.uab.es/?ch=13&com=evaluation&task=3 (accessed April 7, 2020)
  • 2020

Integrating Multimodal Information in Large Pretrained Transformers

Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine- Tuning of BERT and XLNet.

TRIE: End-to-End Text Reading and Information Extraction for Document Understanding

This paper proposes a unified end-to-end text reading and information extraction network, where the two tasks can reinforce each other and the multimodal visual and textual features of text reading are fused for information extraction and in turn, the semantics in information extraction contribute to the optimization of text read.

Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

It is consistently found that multi-phase adaptive pretraining offers large gains in task performance, and it is shown that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable.

PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks

PICK is introduced, a framework that is effective and robust in handling complex documents layout for KIE by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity.

Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout

A new task is introduced (named Kleister) with two new datasets to encourage progress on deeper and more complex Information Extraction (IE) and Pipeline method is proposed as a text-only baseline with different Named Entity Recognition architectures (Flair, BERT, RoBERTa).

Transformers: State-of-the-Art Natural Language Processing

Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.

CORD: A Consolidated Receipt Dataset for Post-OCR Parsing

A consolidated dataset for receipt parsing is published, which consists of thousands of Indonesian receipts, which contains images and box/text annotations for OCR, and multi-level semantic labels for parsing.