Spatially Aware Multimodal Transformers for TextVQA

  title={Spatially Aware Multimodal Transformers for TextVQA},
  author={Yash Kant and Dhruv Batra and Peter Anderson and A. Schwing and Devi Parikh and Jiasen Lu and Harsh Agrawal},
Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limited in their use of spatial relations and rely on fully-connected transformer-like architectures to implicitly learn the spatial structure of a scene. In contrast, we propose a novel spatially aware self-attention layer such that each visual entity… Expand

Figures and Tables from this paper

External Knowledge Augmented Text Visual Question Answering
The open-ended question answering task of Text-VQA requires reading and reasoning about local, often previously unseen, scene-text content of an image to generate answers. In this work, we proposeExpand
Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling
As an important task in multimodal context understanding, Text-VQA (Visual Question Answering) aims at question answering through reading text information in images. It differentiates from theExpand
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
The proposed Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks outperforms the state of the art by large margins on multiple tasks, and builds a large-scale dataset based on the Conceptual Caption dataset, named OCR-CC, which contains 1.4 million scene text-related image-text pairs. Expand
A First Look: Towards Explainable TextVQA Models via Visual and Textual Explanations
MTXNet is proposed, an end-to-end trainable multimodal architecture to generate multi-reference textual explanations that are consistent with human interpretations, help justify the models’ decision, and provide useful insights to help diagnose an incorrect prediction. Expand
Detecting Persuasive Atypicality by Modeling Contextual Compatibility
We propose a new approach to detect atypicality in persuasive imagery. Unlike atypicality which has been studied in prior work, persuasive atypicality has a particular purpose to convey meaning, andExpand
Improved RAMEN: Towards Domain Generalization for Visual Question Answering
Currently nearing human-level performance, Visual Question Answering (VQA) is an emerging area in artificial intelligence. Established as a multi-disciplinary field in machine learning, both computerExpand
LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation
Experiments show that the proposed network outperforms existing state-of-the-art feature-based and deep-learning-based homography estimation methods, and is able to accurately align images under 10× resolution gap. Expand
Multi-Domain Few-Shot Learning and Dataset for Agricultural Applications
Automatic classification of pests and plants (both healthy and diseased) is of paramount importance in agriculture to improve yield. Conventional deep learning models based on convolutional neuralExpand
Question-controlled Text-aware Image Captioning
A novel Geometry and Question Aware Model (GQAM), which achieves better captioning performance and question answering ability than carefully designed baselines on both two datasets, and generates a personalized text-aware caption with a Multimodal Decoder. Expand
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
  • Yuhao Cui, Zhou Yu, +4 authors Jun Yu
  • Computer Science
  • ArXiv
  • 2021
Vision-and-language pretraining (VLP) aims to learn generic multimodal representations frommassive image-text pairs.While various successful attempts have been proposed, learning fine-grainedExpand


Scene Text Visual Question Answering
A new dataset, ST-VQA, is presented that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process and proposes a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. Expand
Relation-Aware Graph Attention Network for Visual Question Answering
A Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive relation representations. Expand
Exploring Visual Relationship for Image Captioning
This paper introduces a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework that novelly integrates both semantic and spatial object relationships into image encoder. Expand
Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA
A novel model is proposed based on a multimodal transformer architecture accompanied by a rich representation for text in images that enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification. Expand
OCR-VQA: Visual Question Answering by Reading Text in Images
This paper introduces a novel task of visual question answering by reading text in images, i.e., by optical character recognition or OCR, and introduces a large-scale dataset, namely OCRVQA-200K, which comprises of 207,572 images of book covers and contains more than 1 million question-answer pairs about these images. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
12-in-1: Multi-Task Vision and Language Representation Learning
This work develops a large-scale, multi-task model that culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification and shows that finetuning task-specific models from this model can lead to further improvements, achieving performance at or above the state-of-the-art. Expand
Towards VQA Models That Can Read
A novel model architecture is introduced that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the images. Expand
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
In this paper, we introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as inputExpand
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
This work adapts the recently proposed ViLBERT model for multi-turn visually-grounded conversations and finds that additional finetuning using "dense" annotations in VisDial leads to even higher NDCG but hurts MRR, highlighting a trade-off between the two primary metrics. Expand