Ask Your Neurons: A Deep Learning Approach to Visual Question Answering

@article{Malinowski2017AskYN,
  title={Ask Your Neurons: A Deep Learning Approach to Visual Question Answering},
  author={Mateusz Malinowski and Marcus Rohrbach and Mario Fritz},
  journal={International Journal of Computer Vision},
  year={2017},
  volume={125},
  pages={110-135}
}
We propose a Deep Learning approach to the visual question answering task, where machines answer to questions about real-world images. By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language inputs (image and question… 
Deep Attention Neural Tensor Network for Visual Question Answering
TLDR
A novel deep attention neural tensor network (DA-NTN) for visual question answering, which can discover the joint correlations over images, questions and answers with tensor-based representations and integrates into the state-of-the-art VQA models.
AnswerNet: Learning to Answer Questions
TLDR
In the proposed model, discriminative features are extracted from both the image and the question, and a hierarchical fusion network is proposed to effectively fuse the image features with the question features.
Tutorial on Answering Questions about Images with Deep Learning
TLDR
This tutorial builds a neural-based approach to answer questions about images that is among the best methods that use a combination of LSTM with a global, full frame CNN representation of an image.
Visual question answering: a state-of-the-art review
TLDR
This review extensively and critically examines the current status of VQA research in terms of step by step solution methodologies, datasets and evaluation metrics and discusses future research directions for all the above-mentioned aspects of V QA separately.
Survey of Visual Question Answering: Datasets and Techniques
TLDR
A survey of the various datasets and models that have been used to tackle visual question answering, classified into four types: non-deep learning models, deep learning models without attention,Deep learning models with attention, and other models which do not fit into the first three.
Two-Step Joint Attention Network for Visual Question Answering
TLDR
This work proposes two-step joint attention that use the combining representation of the image feature and question feature to guide visual attention and question attention and demonstrates and analyzes the effectiveness on the VQA dataset, and uses visualization to show the results intuitively.
The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions
TLDR
The core of the proposed method is a new co-attention model that learns how to exploit a set of external off-the-shelf algorithms to achieve its goal, an approach that has something in common with the Neural Turing Machine.
Research On Visual Question Answering Based On Deep Stacked Attention Network
TLDR
The results show that the accuracy of predicting answers is 1.13% higher than the existing best model, which proves the effectiveness and applicability of the model.
DynGraph: Visual Question Answering via Dynamic Scene Graphs
TLDR
This work proposes a structured approach for VQA that is based on dynamic graphs learned automatically from the input that can be trained end-to-end and does not require additional training labels in the form of predefined graphs or relations.
A Question-Answering framework for plots using Deep learning
TLDR
A deep learning model is described that addresses the reasoning task of question-answering on bar graphs and pie charts by introducing a novel architecture that learns to identify various plot elements, quantify the represented values and determine a relative ordering of these statistical values.
...
...

References

SHOWING 1-10 OF 132 REFERENCES
Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images
We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose
Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering
TLDR
The Spatial Memory Network, a novel spatial attention architecture that aligns words with image patches in the first hop, is proposed and improved results are obtained compared to a strong deep baseline model which concatenates image and question features to predict the answer.
Compositional Memory for Visual Question Answering
TLDR
The end-to-end approach explicitly fuses the features associated to the words and the ones available at multiple local patches in an attention mechanism, and further combines the fused information to generate dynamic messages, which are called episode.
Neural Module Networks
TLDR
A procedure for constructing and learning neural module networks, which compose collections of jointly-trained neural "modules" into deep networks for question answering, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.).
Tutorial on Answering Questions about Images with Deep Learning
TLDR
This tutorial builds a neural-based approach to answer questions about images that is among the best methods that use a combination of LSTM with a global, full frame CNN representation of an image.
Visual7W: Grounded Question Answering in Images
TLDR
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.
ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering
TLDR
The proposed ABC-CNN architecture for visual question answering task (VQA) achieves significant improvements over state-of-the-art methods on three benchmark VQA datasets and is shown to reflect the regions that are highly relevant to the questions.
Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question
TLDR
The mQA model, which is able to answer questions about the content of an image, is presented, which contains four components: a Long Short-Term Memory (LSTM), a Convolutional Neural Network (CNN), an LSTM for storing the linguistic context in an answer, and a fusing component to combine the information from the first three components and generate the answer.
Learning to Answer Questions from Image Using Convolutional Neural Network
TLDR
The proposed CNN provides an end-to-end framework with convolutional architectures for learning not only the image and question representations, but also their inter-modal interactions to produce the answer.
Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources
TLDR
A method for visual question answering which combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions and is specifically able to answer questions posed in natural language, that refer to information not contained in the image.
...
...