A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering

@article{Maharaj2017ADA,
  title={A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering},
  author={Tegan Maharaj and Nicolas Ballas and Anna Rohrbach and Aaron C. Courville and Christopher Joseph Pal},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2017},
  pages={7359-7368}
}
While deep convolutional neural networks frequently approach or exceed human-level performance in benchmark tasks involving static images, extending this success to moving images is not straightforward. Video understanding is of interest for many applications, including content recommendation, prediction, summarization, event/object detection, and understanding human visual perception. However, many domains lack sufficient data to explore and perfect video models. In order to address the need… 
Video Question Generation via Cross-Modal Self-Attention Networks Learning
TLDR
This paper introduces a novel task for automatically generating questions given a sequence of video frames and the corresponding subtitles from a clip of video to reduce the huge annotation cost.
Video Question Answering with Spatio-Temporal Reasoning
TLDR
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.
Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments
Visual understanding goes well beyond the study of images or videos on the web. To achieve complex tasks in volatile situations, the human can deeply understand the environment, quickly perceive
Fill-in-the-blank as a Challenging Video Understanding Evaluation Framework
TLDR
It is shown that both a multimodal model and a strong language model have a large gap with human performance, thus suggesting that the task is more challenging than current video understanding benchmarks.
TVQA: Localized, Compositional Video Question Answering
TLDR
This paper presents TVQA, a large-scale video QA dataset based on 6 popular TV shows, and provides analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVZA task.
On Modality Bias in the TVQA Dataset
TLDR
This work demonstrates an inherent bias in the dataset towards the textual subtitle modality, and proposes subsets of TVQA that respond exclusively to either or both modalities in order to facilitate multimodal modelling as TVZA originally intended.
Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering
TLDR
A novel framework named Dual Hierarchical Temporal Convolutional Network (DHTCN) is proposed to address the aforementioned defects together and obtains state-of-the-art results on the both datasets.
Constructing Hierarchical Q&A Datasets for Video Story Understanding
TLDR
Three criteria for video story understanding are introduced, i.e. memory capacity, logical complexity, and DIKW (Data-Information-Knowledge-Wisdom) pyramid, and it is discussed how three-dimensional map constructed from these criteria can be used as a metric for evaluating the levels of intelligence relating to video storyUnderstanding.
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA
TLDR
This paper proposes a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions, and evaluates the model on the challenging TVQA dataset, where each of the model components provides significant gains.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 54 REFERENCES
Uncovering Temporal Context for Video Question and Answering
TLDR
An encoder-decoder approach using Recurrent Neural Networks to learn temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions is presented.
Visual Madlibs: Fill in the blank Image Generation and Question Answering
TLDR
A new dataset consisting of 360,001 focused natural language descriptions for 10,738 images is introduced and its applicability to two new description generation tasks: focused description generation, and multiple-choice question-answering for images is demonstrated.
Exploring Models and Data for Image Question Answering
TLDR
This work proposes to use neural networks and visual semantic embeddings, without intermediate stages such as object detection and image segmentation, to predict answers to simple questions about images, and presents a question generation algorithm that converts image descriptions into QA form.
Describing Videos by Exploiting Temporal Structure
TLDR
This work proposes an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions and proposes a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN.
End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering
TLDR
A high-level concept word detector that can be integrated with any video-to-language models and which develops a semantic attention mechanism that selectively focuses on the detected concept words and fuse them with the word encoding and decoding in the language model.
Delving Deeper into Convolutional Networks for Learning Video Representations
TLDR
A variant of the GRU model is introduced that leverages the convolution operations to enforce sparse connectivity of the model units and share parameters across the input spatial locations to mitigate the effect of low-level percepts on human action recognition and Video Captioning tasks.
Long-term recurrent convolutional networks for visual recognition and description
TLDR
A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
Large-Scale Video Classification with Convolutional Neural Networks
TLDR
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
Unsupervised Learning of Video Representations using LSTMs
TLDR
This work uses Long Short Term Memory networks to learn representations of video sequences and evaluates the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets.
Video Fill in the Blank with Merging LSTMs
TLDR
A new method is presented which intuitively takes advantage of the structure of the sentences and employs merging LSTM (to merge two LSTMs) to tackle the problem with embedded textural and visual cues.
...
1
2
3
4
5
...