Video Question Answering: Datasets, Algorithms and Challenges

  title={Video Question Answering: Datasets, Algorithms and Challenges},
  author={Yaoyao Zhong and Wei Ji and Junbin Xiao and Yicong Li and Wei Deng and Tat-Seng Chua},
This survey aims to organize the recent advances in video question answering (VideoQA) and point towards future direc-tions. We firstly categorize the datasets into: 1) normal VideoQA, multi-modal VideoQA and knowledge-based VideoQA, according to the modalities invoked in the question-answer pairs, and 2) factoid VideoQA and inference VideoQA, according to the technical challenges in comprehending the questions and deriving the correct answers. We then summarize the VideoQA techniques, including… 

Figures and Tables from this paper

MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding

A novel multi-resolution temporal video sentence grounding network: MRTNet, which consists of a multi-modal feature encoder, a Multi-Resolution Temporal (MRT) module, and a predictor module, which is hot-pluggable and can be seamlessly incorporated into any anchor-free models.

WildQA: In-the-Wild Video Question Answering

This work proposes W ILD QA, a video understanding dataset of videos recorded in outside settings, and introduces the new task of identi-fying visual support for a given question and answer (Video Evidence Selection).

Video Graph Transformer for Video Question Answering

It is shown that VGT can achieve much better performances on VideoQA tasks that challenge dynamic relation reasoning than prior arts in the pretraining-free scenario, and can benefit a lot from self-supervised cross-modal pretraining, yet with orders of magnitude smaller data.



Progressive Graph Attention Network for Video Question Answering

A novel model, termed Progressive Graph Attention Network (PGAT), which can jointly explore the multiple visual relations on object- level, frame-level and clip-level, and the experimental results demonstrate that the model significantly outperforms other state-of-the-art models.

MERLOT: Multimodal Neural Script Knowledge Models

This work introduces MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech – in an entirely label-free, self-supervised manner, and achieves state-ofthe-art performance on 12 different video QA datasets when finetuned.

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

This work proposes to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision, and introduces iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

This work introduces the CoLlision Events for Video REpresentation and Reasoning (CLEVRER), a diagnostic video dataset for systematic evaluation of computational models on a wide range of reasoning tasks, and evaluates various state-of-the-art models for visual reasoning on a benchmark.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Video Question Answering via Gradually Refined Attention over Appearance and Motion

This paper proposes an end-to-end model which gradually refines its attention over the appearance and motion features of the video using the question as guidance and demonstrates the effectiveness of the model by analyzing the refined attention weights during the question answering procedure.

Video Question Answering via Hierarchical Dual-Level Attention Network Learning

This paper develops the hierarchical duallevel attention networks to learn the question-aware video representations with word-level and question-level attention mechanisms, and devise thequestion-level fusion attention mechanism for the proposed networks to learning the questionaware joint video representation.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.

MovieQA: Understanding Stories in Movies through Question-Answering

The MovieQA dataset, which aims to evaluate automatic story comprehension from both video and text, is introduced and existing QA techniques are extended to show that question-answering with such open-ended semantics is hard.