• Corpus ID: 23848326

MemexQA: Visual Memex Question Answering

@article{Jiang2017MemexQAVM,
  title={MemexQA: Visual Memex Question Answering},
  author={Lu Jiang and Junwei Liang and Liangliang Cao and Yannis Kalantidis and Sachin Sudhakar Farfade and Alexander Hauptmann},
  journal={ArXiv},
  year={2017},
  volume={abs/1708.01336}
}
This paper proposes a new task, MemexQA: given a collection of photos or videos from a user, the goal is to automatically answer questions that help users recover their memory about events captured in the collection. [] Key Result Experimental results on the MemexQA dataset demonstrate that MemexNet outperforms strong baselines and yields the state-of-the-art on this novel and challenging task. The promising results on TextQA and VideoQA suggest MemexNet's efficacy and scalability across various QA tasks.

Figures and Tables from this paper

Focal Visual-Text Attention for Memex Question Answering
TLDR
The MemexQA dataset is presented, the first publicly available multimodal question answering dataset consisting of real personal photo albums and an end-to-end trainable network that makes use of a hierarchical process to dynamically determine what media and what time to focus on in the sequential data to answer the question is proposed.
Visual Question Answering using Deep Learning: A Survey and Performance Analysis
TLDR
This survey covers and discusses the recent datasets released in the VQA domain dealing with various types of question-formats and enabling robustness of the machine-learning models, and presents and discusses some of the results computed by us over the vanilla V QA models, Stacked Attention Network and the VqA Challenge 2017 winner model.
Focal Visual-Text Attention for Visual Question Answering
TLDR
A novel neural network called Focal Visual-Text Attention network (FVTA) is described for collective reasoning in visual question answering, where both visual and text sequence information such as images and text metadata are presented.
Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey
TLDR
The presented survey shows that recent works on Memory Networks, Generative Adversarial Networks, and Reinforced Decoders, have the capability to handle the complexities and challenges of video-QA.
Diverse Visuo-Lingustic Question Answering (DVLQA) Challenge
TLDR
A Diverse Visuo-Lingustic Question Answering (DVLQA) challenge corpus, where the task is to derive joint inference about the given image-text modality in a question answering setting and a modular method is developed which demonstrates slightly better baseline performance and offers more transparency for interpretation of intermediate outputs.
Semantic Reanalysis of Scene Words in Visual Question Answering
TLDR
A new image and sentence similarity matching model is proposed, which outputs the correct image representation by learning the semantic concept and improves the accuracy by nearly 10%.
Progressive Attention Memory Network for Movie Story Question Answering
TLDR
Experiments on publicly available benchmark datasets, MovieQA and TVQA, demonstrate that each feature contributes to the movie story QA architecture, PAMN, and improves performance to achieve the state-of-the-art result.
Photo Stream Question Answer
TLDR
This paper presents a new visual question answering (VQA) task -- Photo Stream QA, which aims to answer the open-ended questions about a narrative photo stream, and proposes an end-to-end baseline (E-TAA), which provides promising results outperforming all the other baseline methods.
...
...

References

SHOWING 1-10 OF 36 REFERENCES
Dynamic Memory Networks for Visual and Textual Question Answering
TLDR
The new DMN+ model improves the state of the art on both the Visual Question Answering dataset and the \babi-10k text question-answering dataset without supporting fact supervision.
Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images
We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose
Hierarchical Question-Image Co-Attention for Visual Question Answering
TLDR
This paper presents a novel co-attention model for VQA that jointly reasons about image and question attention in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).
Stacked Attention Networks for Image Question Answering
TLDR
A multiple-layer SAN is developed in which an image is queried multiple times to infer the answer progressively, and the progress that the SAN locates the relevant visual clues that lead to the answer of the question layer-by-layer.
Dynamic Coattention Networks For Question Answering
TLDR
The Dynamic Coattention Network (DCN) for question answering first fuses co-dependent representations of the question and the document in order to focus on relevant parts of both, then a dynamic pointing decoder iterates over potential answer spans to recover from initial local maxima corresponding to incorrect answers.
Neural Module Networks
TLDR
A procedure for constructing and learning neural module networks, which compose collections of jointly-trained neural "modules" into deep networks for question answering, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.).
Gated-Attention Readers for Text Comprehension
TLDR
The Gated-Attention (GA) Reader, a model that integrates a multi-hop architecture with a novel attention mechanism, which is based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader, enables the reader to build query-specific representations of tokens in the document for accurate answer selection.
SQuAD: 100,000+ Questions for Machine Comprehension of Text
TLDR
A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
Machine Comprehension Using Match-LSTM and Answer Pointer
TLDR
This work proposes an end-to-end neural architecture for the Stanford Question Answering Dataset (SQuAD), based on match-LSTM, a model previously proposed previously for textual entailment, and Pointer Net, a sequence- to-sequence model proposed by Vinyals et al.(2015) to constrain the output tokens to be from the input sequences.
...
...