Video Question Answering with Iterative Video-Text Co-Tokenization

  title={Video Question Answering with Iterative Video-Text Co-Tokenization},
  author={A. J. Piergiovanni and Kairo Morton and Weicheng Kuo and Michael S. Ryoo and Anelia Angelova},
  booktitle={European Conference on Computer Vision},
. Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of questions related to videos. We experimentally evaluate the model… 

Video Question Answering: Datasets, Algorithms and Challenges

This survey aims to sort out the recent advances in video question answering (VideoQA) and point towards future directions, including those mainly designed for Factoid QA and those targeted at explicit relation and logic inference.

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

Extensive experiments on three benchmark text-video retrieval datasets prove that the proposed EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics.

Compound Tokens: Channel Fusion for Vision-Language Representation Learning

The effectiveness of compound tokens is demonstrated using an encoder-decoder vision-language model trained end-to-end in the open-vocabulary setting and achieves highly competitive performance across a range of question answering tasks including GQA, VQA2.0, and SNLI-VE.

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able



Hierarchical Relational Attention for Video Question Answering

The proposed VideoQA model derives attention on temporal segments i.e. video features based on each of the question words to derive the final video representation which leads to a better reasoning capability.

Motion-Appearance Co-memory Networks for Video Question Answering

The proposed motion-appearance co-memory network is built on concepts from Dynamic Memory Network (DMN) and introduces new mechanisms for video QA and outperform state-of-the-art significantly on all four tasks of TGIF-QA.

Video Question Answering via Gradually Refined Attention over Appearance and Motion

This paper proposes an end-to-end model which gradually refines its attention over the appearance and motion features of the video using the question as guidance and demonstrates the effectiveness of the model by analyzing the refined attention weights during the question answering procedure.

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

This work introduces ActivityNet-QA, a fully annotated and large scale VideoQA dataset, which consists of 58,000 QA pairs on 5,800 complex web videos derived from the popular ActivityNet dataset and explores various video representation strategies to improve videoQA performance.

KnowIT VQA: Answering Knowledge-Based Questions about Videos

This work introduces KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom, and proposes a video understanding model by combining the visual and textual video content with specific knowledge about the show.

DeepStory: Video Story QA by Deep Embedded Memory Networks

A video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data to outperform other QA models.

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Video-And-Language Understanding Evaluation (VALUE) benchmark is introduced, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning, which promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks.

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

This work proposes to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision, and introduces iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.

Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

In this paper, we propose a novel end-to-end trainable Video Question Answering (VideoQA) framework with three major components: 1) a new heterogeneous memory which can effectively learn global

Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering

Experimental results and comparisons with the state-of-the-art methods have shown that the proposed Question-Guided Spatio-Temporal Contextual Attention Network (QueST) method can achieve superior performance.