Spatially Aware Multimodal Transformers for TextVQA

@article{Kant2020SpatiallyAM,
  title={Spatially Aware Multimodal Transformers for TextVQA},
  author={Yash Kant and Dhruv Batra and Peter Anderson and Alexander G. Schwing and Devi Parikh and Jiasen Lu and Harsh Agrawal},
  journal={ArXiv},
  year={2020},
  volume={abs/2007.12146}
}
Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limited in their use of spatial relations and rely on fully-connected transformer-like architectures to implicitly learn the spatial structure of a scene. In contrast, we propose a novel spatially aware self-attention layer such that each visual entity… 

Figures and Tables from this paper

Beyond OCR + VQA: Involving OCR into the Flow for Robust and Accurate TextVQA
TLDR
A visually enhanced text embedding is proposed to enable understanding of texts without accurately recognizing them and rich contextual information is further leverage to modify the answer texts even if the OCR module does not correctly recognize them.
External Knowledge Augmented Text Visual Question Answering
TLDR
This work designs a framework to extract, validate, and reason with knowledge using a standard multimodal transformer for vision language understanding tasks, and demonstrates how external knowledge can highlight instance-only cues and thus help deal with training data bias, improve answer entity type correctness, and detect multiword named entities.
External Knowledge enabled Text Visual Question Answering
TLDR
This work designs a framework to extract, validate, and reason with knowledge using a standard multimodal transformer for vision language understanding tasks, and demonstrates how external knowledge can highlight instance-only cues and thus help deal with training data bias, improve answer entity type correctness, and detect multiword named entities.
Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture
TLDR
A Graph Relation Transformer (GRT), which uses edge information in addition to node information for graph attention computation in the Transformer, is proposed and observed that the GRT has superior spatial reasoning ability to M4C.
Position-Augmented Transformers with Entity-Aligned Mesh for TextVQA
TLDR
This work proposes a novel model, position-augmented transformers with entity-aligned mesh, for the TextVQA task, and explicitly introduces continuous relative position information of objects and OCR tokens without complex rules.
Structured Multimodal Attentions for TextVQA
  • Chenyu Gao, Qi Zhu, +4 authors Qi Wu
  • Computer Science, Medicine
    IEEE transactions on pattern analysis and machine intelligence
  • 2021
TLDR
An end-to-end structured multimodal attention (SMA) neural network is proposed to mainly solve the first two issues above.
Towards Reasoning Ability in Scene Text Visual Question Answering
TLDR
This work designs a gradient-based explainability method to explore why TextVQA models answer what they answer and finds evidence for their predictions, and performs qualitative experiments to visually analyze models reasoning ability.
LaTr: Layout-Aware Transformer for Scene-Text VQA
TLDR
This work proposes a novel multimodal architecture for Scene Text Visual Question Answering (STVQA), named LayoutAware Transformer (LaTr), and demonstrates that LaTr improves robustness towards OCR errors, and eliminates the need for an external object detector.
Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling
  • Xiaopeng Lu, Zhen Fan, Yansen Wang, Jean Oh, C. Rosé
  • Computer Science
    2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
  • 2021
TLDR
Experiments show that LOGOS outperforms previous state-of-the-art methods on two Text-VQA benchmarks without using additional OCR annotation data, and demonstrate the capability of LOGOS to bridge different modalities and better understand scene text.
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
TLDR
This paper proposes Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks, and builds a large-scale scene text-related imagetext dataset based on the Conceptual Caption dataset, named OCR-CC, which contains 1:4 million images with scene text.
...
1
2
3
...

References

SHOWING 1-10 OF 59 REFERENCES
Scene Text Visual Question Answering
TLDR
A new dataset, ST-VQA, is presented that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process and proposes a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
OCR-VQA: Visual Question Answering by Reading Text in Images
TLDR
This paper introduces a novel task of visual question answering by reading text in images, i.e., by optical character recognition or OCR, and introduces a large-scale dataset, namely OCRVQA-200K, which comprises of 207,572 images of book covers and contains more than 1 million question-answer pairs about these images.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Towards VQA Models That Can Read
TLDR
A novel model architecture is introduced that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the images.
12-in-1: Multi-Task Vision and Language Representation Learning
TLDR
This work develops a large-scale, multi-task model that culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification and shows that finetuning task-specific models from this model can lead to further improvements, achieving performance at or above the state-of-the-art.
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
In this paper, we introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as input
Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA
TLDR
A novel model is proposed based on a multimodal transformer architecture accompanied by a rich representation for text in images that enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification.
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
TLDR
This work adapts the recently proposed ViLBERT model for multi-turn visually-grounded conversations and finds that additional finetuning using "dense" annotations in VisDial leads to even higher NDCG but hurts MRR, highlighting a trade-off between the two primary metrics.
The Open Images Dataset V4
TLDR
In-depth comprehensive statistics about the dataset are provided, the quality of the annotations are validated, the performance of several modern models evolves with increasing amounts of training data, and two applications made possible by having unified annotations of multiple types coexisting in the same images are demonstrated.
...
1
2
3
4
5
...