An Improved Attention for Visual Question Answering

@article{Rahman2021AnIA,
  title={An Improved Attention for Visual Question Answering},
  author={Tanzila Rahman and Shih-Han Chou and Leonid Sigal and Giuseppe Carenini},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
  year={2021},
  pages={1653-1662}
}
We consider the problem of Visual Question Answering (VQA). Given an image and a free-form, open-ended, question, expressed in natural language, the goal of VQA system is to provide accurate answer to this question with respect to the image. The task is challenging because it requires simultaneous and intricate understanding of both visual and textual information. Attention, which captures intra- and inter-modal dependencies, has emerged as perhaps the most widely used mechanism for addressing… Expand

Figures and Tables from this paper

References

SHOWING 1-10 OF 38 REFERENCES
Attention on Attention for Image Captioning
TLDR
An Attention on Attention (AoA) module is proposed, which extends the conventional attention mechanisms to determine the relevance between attention results and queries and is applied to both the encoder and the decoder of the image captioning model, which is named as AoA Network. Expand
Deep Modular Co-Attention Networks for Visual Question Answering
TLDR
A deep Modular Co-Attention Network (MCAN) that consists of Modular co-attention layers cascaded in depth that significantly outperforms the previous state-of-the-art models and is quantitatively and qualitatively evaluated on the benchmark VQA-v2 dataset. Expand
Differential Networks for Visual Question Answering
TLDR
This work proposes DN based Fusion (DF), a novel model for VQA task that achieves state-of-the-art results on four publicly available datasets and shows the effectiveness of difference operations in DF model. Expand
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
TLDR
A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA. Expand
Hierarchical Question-Image Co-Attention for Visual Question Answering
TLDR
This paper presents a novel co-attention model for VQA that jointly reasons about image and question attention in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Expand
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
TLDR
This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features. Expand
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural languageExpand
Visual Question Answering using Deep Learning: A Survey and Performance Analysis
TLDR
This survey covers and discusses the recent datasets released in the VQA domain dealing with various types of question-formats and enabling robustness of the machine-learning models, and presents and discusses some of the results computed by us over the vanilla V QA models, Stacked Attention Network and the VqA Challenge 2017 winner model. Expand
Cycle-Consistency for Robust Visual Question Answering
TLDR
A model-agnostic framework is proposed that trains a model to not only answer a question, but also generate a question conditioned on the answer, such that the answer predicted for the generated question is the same as the ground truth answer to the original question. Expand
Learning Two-Branch Neural Networks for Image-Text Matching Tasks
TLDR
This paper investigates two-branch neural networks for learning the similarity between image-sentence matching and region-phrase matching, and proposes two network structures that produce different output representations. Expand
...
1
2
3
4
...