Spatially Aware Multimodal Transformers for TextVQA

@article{Kant2020SpatiallyAM,
  title={Spatially Aware Multimodal Transformers for TextVQA},
  author={Yash Kant and Dhruv Batra and Peter Anderson and Alexander G. Schwing and Devi Parikh and Jiasen Lu and Harsh Agrawal},
  journal={ArXiv},
  year={2020},
  volume={abs/2007.12146}
}
Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limited in their use of spatial relations and rely on fully-connected transformer-like architectures to implicitly learn the spatial structure of a scene. In contrast, we propose a novel spatially aware self-attention layer such that each visual entity… Expand

Figures and Tables from this paper

External Knowledge Augmented Text Visual Question Answering
TLDR
This work designs a framework to extract, validate, and reason with knowledge using a standard multimodal transformer for vision language understanding tasks, and demonstrates how external knowledge can highlight instance-only cues and thus help deal with training data bias, improve answer entity type correctness, and detect multiword named entities. Expand
Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling
TLDR
Experiments show that LOGOS outperforms previous state-ofthe-art methods on two Text-VQA benchmarks without using additional OCR annotation data and demonstrate the capability of LOGOS to bridge different modalities and better understand scene text. Expand
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
TLDR
The proposed Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks outperforms the state of the art by large margins on multiple tasks, and builds a large-scale dataset based on the Conceptual Caption dataset, named OCR-CC, which contains 1.4 million scene text-related image-text pairs. Expand
A First Look: Towards Explainable TextVQA Models via Visual and Textual Explanations
TLDR
MTXNet is proposed, an end-to-end trainable multimodal architecture to generate multi-reference textual explanations that are consistent with human interpretations, help justify the models’ decision, and provide useful insights to help diagnose an incorrect prediction. Expand
Detecting Persuasive Atypicality by Modeling Contextual Compatibility
We propose a new approach to detect atypicality in persuasive imagery. Unlike atypicality which has been studied in prior work, persuasive atypicality has a particular purpose to convey meaning, andExpand
External Knowledge enabled Text Visual Question Answering
TLDR
This work designs a framework to extract, validate, and reason with knowledge using a standard multimodal transformer for vision language understanding tasks, and demonstrates how external knowledge can highlight instance-only cues and thus help deal with training data bias, improve answer entity type correctness, and detect multiword named entities. Expand
Improved RAMEN: Towards Domain Generalization for Visual Question Answering
TLDR
This study provides two major improvements to the early/late fusion module and aggregation module of the RAMEN architecture, with the objective of further strengthening domain generalization. Expand
LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation
TLDR
Experiments show that the proposed network outperforms existing state-of-the-art feature-based and deep-learning-based homography estimation methods, and is able to accurately align images under 10× resolution gap. Expand
Multi-Domain Few-Shot Learning and Dataset for Agricultural Applications
TLDR
This work proposes a method to learn from a few samples to automatically classify different pests, plants, and their diseases, using Few-Shot Learning (FSL), which has been shown to outperform several existing FSL architectures in agriculture. Expand
Question-controlled Text-aware Image Captioning
TLDR
A novel Geometry and Question Aware Model (GQAM), which achieves better captioning performance and question answering ability than carefully designed baselines on both two datasets, and generates a personalized text-aware caption with a Multimodal Decoder. Expand
...
1
2
...

References

SHOWING 1-10 OF 59 REFERENCES
Scene Text Visual Question Answering
TLDR
A new dataset, ST-VQA, is presented that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process and proposes a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
OCR-VQA: Visual Question Answering by Reading Text in Images
TLDR
This paper introduces a novel task of visual question answering by reading text in images, i.e., by optical character recognition or OCR, and introduces a large-scale dataset, namely OCRVQA-200K, which comprises of 207,572 images of book covers and contains more than 1 million question-answer pairs about these images. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Towards VQA Models That Can Read
TLDR
A novel model architecture is introduced that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the images. Expand
12-in-1: Multi-Task Vision and Language Representation Learning
TLDR
This work develops a large-scale, multi-task model that culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification and shows that finetuning task-specific models from this model can lead to further improvements, achieving performance at or above the state-of-the-art. Expand
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
In this paper, we introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as inputExpand
Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA
TLDR
A novel model is proposed based on a multimodal transformer architecture accompanied by a rich representation for text in images that enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification. Expand
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
TLDR
This work adapts the recently proposed ViLBERT model for multi-turn visually-grounded conversations and finds that additional finetuning using "dense" annotations in VisDial leads to even higher NDCG but hurts MRR, highlighting a trade-off between the two primary metrics. Expand
The Open Images Dataset V4
TLDR
In-depth comprehensive statistics about the dataset are provided, the quality of the annotations are validated, the performance of several modern models evolves with increasing amounts of training data, and two applications made possible by having unified annotations of multiple types coexisting in the same images are demonstrated. Expand
...
1
2
3
4
5
...