Spatially Aware Multimodal Transformers for TextVQA

@article{Kant2020SpatiallyAM,
  title={Spatially Aware Multimodal Transformers for TextVQA},
  author={Yash Kant and Dhruv Batra and Peter Anderson and Alexander G. Schwing and Devi Parikh and Jiasen Lu and Harsh Agrawal},
  journal={ArXiv},
  year={2020},
  volume={abs/2007.12146}
}
Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limited in their use of spatial relations and rely on fully-connected transformer-like architectures to implicitly learn the spatial structure of a scene. In contrast, we propose a novel spatially aware self-attention layer such that each visual entity… 

Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering

—Text-based Visual Question Answering (TextVQA) aims to produce correct answers for given questions about the images with multiple scene texts. In most cases, the texts naturally attach to the

Beyond OCR + VQA: Involving OCR into the Flow for Robust and Accurate TextVQA

TLDR
A visually enhanced text embedding is proposed to enable understanding of texts without accurately recognizing them and rich contextual information is further leverage to modify the answer texts even if the OCR module does not correctly recognize them.

TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation

Text-VQA aims at answering questions that require understanding the textual cues in an image. Despite the great progress of existing Text-VQA methods, their performance suffers from insufficient

Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture

TLDR
A Graph Relation Transformer (GRT), which uses edge information in addition to node information for graph attention computation in the Transformer, is proposed and observed that the GRT has superior spatial reasoning ability to M4C.

Towards Reasoning Ability in Scene Text Visual Question Answering

TLDR
This work designs a gradient-based explainability method to explore why TextVQA models answer what they answer and finds evidence for their predictions, and performs qualitative experiments to visually analyze models reasoning ability.

Position-Augmented Transformers with Entity-Aligned Mesh for TextVQA

TLDR
This work proposes a novel model, position-augmented transformers with entity-aligned mesh, for the TextVQA task, and explicitly introduces continuous relative position information of objects and OCR tokens without complex rules.

External Knowledge enabled Text Visual Question Answering

TLDR
This work designs a framework to extract, validate, and reason with knowledge using a standard multimodal transformer for vision language understanding tasks, and demonstrates how external knowledge can highlight instance-only cues and thus help deal with training data bias, improve answer entity type correctness, and detect multiword named entities.

External Knowledge Augmented Text Visual Question Answering

TLDR
This work designs a framework to extract, validate, and reason with knowledge using a standard multimodal transformer for vision language understanding tasks, and demonstrates how external knowledge can highlight instance-only cues and thus help deal with training data bias, improve answer entity type correctness, and detect multiword named entities.

Structured Multimodal Attentions for TextVQA

  • Chenyu GaoQi Zhu Qi Wu
  • Computer Science
    IEEE transactions on pattern analysis and machine intelligence
  • 2021
TLDR
An end-to-end structured multimodal attention (SMA) neural network is proposed to mainly solve the first two issues above.

EKTVQA: Generalized Use of External Knowledge to Empower Scene Text in Text-VQA

TLDR
This work designs a framework to extract, validate, and reason with knowledge using a standard multimodal transformer for vision language understanding tasks, and demonstrates how external knowledge can highlight instance-only cues and thus help deal with training data bias, improve answer entity type correctness, and detect multiword named entities.

References

SHOWING 1-10 OF 54 REFERENCES

Scene Text Visual Question Answering

TLDR
A new dataset, ST-VQA, is presented that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process and proposes a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module.

OCR-VQA: Visual Question Answering by Reading Text in Images

TLDR
This paper introduces a novel task of visual question answering by reading text in images, i.e., by optical character recognition or OCR, and introduces a large-scale dataset, namely OCRVQA-200K, which comprises of 207,572 images of book covers and contains more than 1 million question-answer pairs about these images.

Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA

TLDR
A novel model is proposed based on a multimodal transformer architecture accompanied by a rich representation for text in images that enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Attention is All you Need

TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

12-in-1: Multi-Task Vision and Language Representation Learning

TLDR
This work develops a large-scale, multi-task model that culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification and shows that finetuning task-specific models from this model can lead to further improvements, achieving performance at or above the state-of-the-art.

Towards VQA Models That Can Read

TLDR
A novel model architecture is introduced that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the images.

UNITER: UNiversal Image-TExt Representation Learning

TLDR
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

In this paper, we introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as input

Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

TLDR
This work adapts the recently proposed ViLBERT model for multi-turn visually-grounded conversations and finds that additional finetuning using "dense" annotations in VisDial leads to even higher NDCG but hurts MRR, highlighting a trade-off between the two primary metrics.
...