TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

@article{Yang2020TAPTP,
  title={TAP: Text-Aware Pre-training for Text-VQA and Text-Caption},
  author={Zhengyuan Yang and Yijuan Lu and Jianfeng Wang and Xi Yin and Dinei A. F. Flor{\^e}ncio and Lijuan Wang and Cha Zhang and Lei Zhang and Jiebo Luo},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2020},
  pages={8747-8757}
}
  • Zhengyuan YangYijuan Lu Jiebo Luo
  • Published 8 December 2020
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks. These two tasks aim at reading and understanding scene text in images for question answering and image caption generation, respectively. In contrast to conventional vision-language pretraining that fails to capture scene text and its relationship with the visual and text modalities, TAP explicitly incorporates scene text (generated from OCR engines) during pretraining. With three pre-training tasks… 

Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language Representation Learning with Pre-trained Sequence-to-Sequence Model

This challenge uses generative model T5 for TextVQA task and proposes to perform adversarial training in the embedding space of each modality, rather than adding adversarial perturbations on image pixels and textual tokens.

Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling

Experiments show that LOGOS outperforms previous state-of-the-art methods on two Text-VQA benchmarks without using additional OCR annotation data, and demonstrate the capability of LOGOS to bridge different modalities and better understand scene text.

Vision-Language Pre-Training for Boosting Scene Text Detectors

This paper proposes to learn contextualized, joint representations through vision-language pretraining, for the sake of enhancing the performance of scene text detectors, and devise a pre-training architecture with an image encoder, a text encoder and a cross-modal encoder.

PreSTU: Pre-Training for Scene-Text Understanding

P RE STU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and to connect what is recognized to the rest of the image content.

TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation

Text-VQA aims at answering questions that require understanding the textual cues in an image. Despite the great progress of existing Text-VQA methods, their performance suffers from insufficient

Question-controlled Text-aware Image Captioning

A novel Geometry and Question Aware Model (GQAM), which achieves better captioning performance and question answering ability than carefully designed baselines on both two datasets, and generates a personalized text-aware caption with a Multimodal Decoder.

Structured Multimodal Attentions for TextVQA

  • Chenyu GaoQi Zhu Qi Wu
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2022
This paper proposes an end-to-end structured multimodal attention (SMA) neural network that outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQ a dataset among all models except pre-training based TAP.

Towards Models that Can See and Read

It is shown that scene-text understanding capabilities can boost vision-language models’ performance on VQA and CAP by up to 3 .

LaTr: Layout-Aware Transformer for Scene-Text VQA

A novel multimodal architecture for Scene Text Visual Question Answering (STVQA), named Layout-Aware Transformer (LaTr), which performs vocabulary-free decoding and generalizes well beyond the training vocabulary, and improves robustness towards OCR errors.

Towards Multimodal Vision-Language Models Generating Non-Generic Text

This work contends that vision-language models can benefit from information that can be extracted from an image, but are not used by current models, and modify previous multimodal frameworks to accept relevant information from any number of auxiliary classifiers.
...

References

SHOWING 1-10 OF 65 REFERENCES

UNITER: UNiversal Image-TExt Representation Learning

UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.

UNITER: Learning UNiversal Image-TExt Representations

UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.

Structured Multimodal Attentions for TextVQA

  • Chenyu GaoQi Zhu Qi Wu
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2022
This paper proposes an end-to-end structured multimodal attention (SMA) neural network that outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQ a dataset among all models except pre-training based TAP.

Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning

A novel design is presented - Multimodal Attention Captioner with OCR Spatial Relationship (dubbed as MMA-SR) architecture, which manages information from different modalities with a multimodal attention network and explores spatial relationships between text tokens for OCR-based image captioning.

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

This paper proposes a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments.

Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

This paper proposes a localization-aware answer prediction network (LaAP-Net) that not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer.

Unified Vision-Language Pre-Training for Image Captioning and VQA

VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

A Confidence-aware Non-repetitive Multimodal Transformers (CNMT) that outperforms state-of-the-art models on TextCaps dataset, improving from 81.0 to 93.0 in CIDEr and addressing the issue of word redundancy in captions.

TextCaps: a Dataset for Image Captioning with Reading Comprehension

A novel dataset, TextCaps, with 145k captions for 28k images, challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects.

Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA

A novel model is proposed based on a multimodal transformer architecture accompanied by a rich representation for text in images that enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification.
...