TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

@article{Jang2017TGIFQATS,
  title={TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering},
  author={Y. Jang and Yale Song and Youngjae Yu and Youngjin Kim and Gunhee Kim},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2017},
  pages={1359-1367}
}
  • Y. Jang, Yale Song, +2 authors Gunhee Kim
  • Published 14 April 2017
  • Computer Science
  • 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Vision and language understanding has emerged as a subject undergoing intense study in Artificial Intelligence. Among many tasks in this line of research, visual question answering (VQA) has been one of the most successful ones, where the goal is to learn a model that understands visual content at region-level details and finds their associations with pairs of questions and answers in the natural language form. Despite the rapid progress in the past few years, most existing work in VQA have… 
Video Question Answering with Spatio-Temporal Reasoning
TLDR
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.
Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network
TLDR
A knowledge-based progressive spatial-temporal attention network is proposed to tackle the problem of video question answering by taking the spatial and temporal dimension of video content into account and employing an external knowledge base to improve the answering ability of the network.
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
TLDR
This paper addresses the task of knowledge-based visual question answering and provides a benchmark, called OK-VQA, where the image content is not sufficient to answer the questions, encouraging methods that rely on external knowledge resources.
Photo Stream Question Answer
TLDR
This paper presents a new visual question answering (VQA) task -- Photo Stream QA, which aims to answer the open-ended questions about a narrative photo stream, and proposes an end-to-end baseline (E-TAA), which provides promising results outperforming all the other baseline methods.
Learnable Aggregating Net with Diversity Learning for Video Question Answering
TLDR
A novel architecture, namely Learnable Aggregating Net with Diversity learning (LAD-Net), for V-VQA, which automatically aggregates adaptively-weighted frame-level features to extract rich video (or question) context semantic information by imitating Bags-of-Words (BoW) quantization.
TVQA+: Spatio-Temporal Grounding for Video Question Answering
TLDR
By performing this joint task, the proposed Spatio-Temporal Answerer with Grounded Evidence with STAGE model is able to produce insightful and interpretable spatio-temporal attention visualizations.
Spatio-temporal Relational Reasoning for Video Question Answering
TLDR
This work presents a novel spatio-temporal reasoning neural module which enables modeling complex multi-entity relationships in space and long-term ordered dependencies in time and achieves state-of-the-art performance on two benchmark datasets: TGIF-QA and SVQA.
Multi-Question Learning for Visual Question Answering
TLDR
An effective VQA framework is proposed and a training procedure for MQL is designed, where the specifically designed attention network models the relation between input video and corresponding questions, enabling multiple video-question pairs to be co-trained.
Structured Two-Stream Attention Network for Video Question Answering
TLDR
This paper proposes a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video to infer rich longrange temporal structures in videos using the authors' structured segment component and encode text features.
Focal Visual-Text Attention for Visual Question Answering
TLDR
A novel neural network called Focal Visual-Text Attention network (FVTA) is described for collective reasoning in visual question answering, where both visual and text sequence information such as images and text metadata are presented.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 49 REFERENCES
Visual7W: Grounded Question Answering in Images
TLDR
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
TLDR
This work balances the popular VQA dataset by collecting complementary images such that every question in this balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.
Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images
We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose
What Value Do Explicit High Level Concepts Have in Vision to Language Problems?
TLDR
A method of incorporating high-level concepts into the successful CNN-RNN approach is proposed, and it is shown that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering.
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
Exploring Models and Data for Image Question Answering
TLDR
This work proposes to use neural networks and visual semantic embeddings, without intermediate stages such as object detection and image segmentation, to predict answers to simple questions about images, and presents a question generation algorithm that converts image descriptions into QA form.
Visual Dialog
TLDR
A retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response, and a family of neural encoder-decoder models, which outperform a number of sophisticated baselines.
Visual Madlibs: Fill in the Blank Description Generation and Question Answering
TLDR
A new dataset consisting of 360,001 focused natural language descriptions for 10,738 images is introduced and its applicability to two new description generation tasks: focused description generation, and multiple-choice question-answering for images is demonstrated.
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
TLDR
This work extensively evaluates Multimodal Compact Bilinear pooling (MCB) on the visual question answering and grounding tasks and consistently shows the benefit of MCB over ablations without MCB.
End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering
TLDR
A high-level concept word detector that can be integrated with any video-to-language models and which develops a semantic attention mechanism that selectively focuses on the detected concept words and fuse them with the word encoding and decoding in the language model.
...
1
2
3
4
5
...