Visual Question Answering using Deep Learning: A Survey and Performance Analysis

@inproceedings{Srivastava2020VisualQA,
  title={Visual Question Answering using Deep Learning: A Survey and Performance Analysis},
  author={Yash Srivastava and Vaishnav Murali and Shiv Ram Dubey and Snehasis Mukherjee},
  booktitle={CVIP},
  year={2020}
}
The Visual Question Answering (VQA) task combines challenges for processing data with both Visual and Linguistic processing, to answer basic `common sense' questions about given images. Given an image and a question in natural language, the VQA system tries to find the correct answer to it using visual elements of the image and inference gathered from textual questions. In this survey, we cover and discuss the recent datasets released in the VQA domain dealing with various types of question… 
VQA With No Questions-Answers Training
  • B. Vatashsky, S. Ullman
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
TLDR
This approach is able to handle novel domains (extended question types and new object classes, properties and relations) as long as corresponding visual estimators are available and can provide explanations to its answers and suggest alternatives when questions are not grounded in the image.
Natural Language Processing based Visual Question Answering Efficient: an EfficientDet Approach
TLDR
The proposed technique uses EfficientDet for image processing and BiLSTM for question processing and acts efficiently because of efficient image processing.
A Multimodal Memes Classification: A Survey and Open Research Issues
TLDR
This study presents a clear road-map for the Machine Learning (ML) research community to implement and enhance memes classification techniques, and proposes a generalized framework for VL problems.
Natural Language Processing: Challenges and Future Directions
TLDR
This paper provides a short overview of NLP, then dives into the different challenges that are facing it, and concludes by presenting recent trends and future research directions that are speculated by the research community.
COIN: Counterfactual Image Generation for VQA Interpretation
Due to the significant advancement of Natural Language Processing and Computer Vision-based models, Visual Question Answering (VQA) systems are becoming more intelligent and advanced. However, they
GQA-it: Italian Question Answering on Image Scene Graphs
TLDR
This paper explores the possibility of acquiring in a semiautomatic fashion a large-scale dataset for VQA in Italian, with experimental results comparable with those obtained on the English original material.
Linguistic issues behind visual question answering
TLDR
This paper extracts from pioneering computational linguistic work a list of desiderata that are used to review current computational achievements and claims that further research is needed to get to a unified approach which jointly encompasses all the underlying linguistic problems.
A Bird’s Eye View of Natural Language Processing and Requirements Engineering
TLDR
It is asserted that human involvement with knowledge about the domain and the specific project is still needed in the RE process despite progress in the development of NLP systems.
A Comprehensive Review of the Video-to-Text Problem
TLDR
This paper reviews the video-to-text problem, in which the goal is to associate an input video with its textual description, and categorizes and describes the state-of-the-art techniques.
A Survey On Visual Question Answering
Visual question answering (VQA) is a multi-disciplinary task. The main aim of VQA system is to provide natural language answer to an open-ended question about a given image. This task involves both
...
1
2
...

References

SHOWING 1-10 OF 36 REFERENCES
An Analysis of Visual Question Answering Algorithms
TLDR
This paper analyzes existing VQA algorithms using a new dataset called the Task Driven Image Understanding Challenge (TDIUC), which has over 1.6 million questions organized into 12 different categories, and proposes new evaluation schemes that compensate for over-represented question-types and make it easier to study the strengths and weaknesses of algorithms.
Exploring Models and Data for Image Question Answering
TLDR
This work proposes to use neural networks and visual semantic embeddings, without intermediate stages such as object detection and image segmentation, to predict answers to simple questions about images, and presents a question generation algorithm that converts image descriptions into QA form.
Visual7W: Grounded Question Answering in Images
TLDR
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.
Stacked Attention Networks for Image Question Answering
TLDR
A multiple-layer SAN is developed in which an image is queried multiple times to infer the answer progressively, and the progress that the SAN locates the relevant visual clues that lead to the answer of the question layer-by-layer.
Differential Networks for Visual Question Answering
TLDR
This work proposes DN based Fusion (DF), a novel model for VQA task that achieves state-of-the-art results on four publicly available datasets and shows the effectiveness of difference operations in DF model.
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge
TLDR
This work presents a massive exploration of the effects of the myriad architectural and hyperparameter choices that must be made in generating a state-of-the-art model and provides a detailed analysis of the impact of each choice on model performance.
Visual Madlibs: Fill in the Blank Description Generation and Question Answering
TLDR
A new dataset consisting of 360,001 focused natural language descriptions for 10,738 images is introduced and its applicability to two new description generation tasks: focused description generation, and multiple-choice question-answering for images is demonstrated.
MemexQA: Visual Memex Question Answering
TLDR
Experimental results on the MemexQA dataset demonstrate that MemexNet outperforms strong baselines and yields the state-of-the-art on this novel and challenging task, and suggest Memex net's efficacy and scalability across various QA tasks.
Differential Attention for Visual Question Answering
TLDR
This paper obtains one or more supporting and opposing exemplars to obtain a differential attention region that is closer to human attention than other image based attention methods and helps in obtaining improved accuracy when answering questions.
...
1
2
3
4
...