Video Question Answering with Iterative Video-Text Co-Tokenization
@inproceedings{Piergiovanni2022VideoQA, title={Video Question Answering with Iterative Video-Text Co-Tokenization}, author={A. J. Piergiovanni and Kairo Morton and Weicheng Kuo and Michael S. Ryoo and Anelia Angelova}, booktitle={European Conference on Computer Vision}, year={2022} }
. Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of questions related to videos. We experimentally evaluate the model…
Figures and Tables from this paper
4 Citations
Video Question Answering: Datasets, Algorithms and Challenges
- Computer ScienceEMNLP
- 2022
This survey aims to sort out the recent advances in video question answering (VideoQA) and point towards future directions, including those mainly designed for Factoid QA and those targeted at explicit relation and logic inference.
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
- Computer ScienceArXiv
- 2022
Extensive experiments on three benchmark text-video retrieval datasets prove that the proposed EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics.
Compound Tokens: Channel Fusion for Vision-Language Representation Learning
- Computer ScienceArXiv
- 2022
The effectiveness of compound tokens is demonstrated using an encoder-decoder vision-language model trained end-to-end in the open-vocabulary setting and achieves highly competitive performance across a range of question answering tasks including GQA, VQA2.0, and SNLI-VE.
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
- Computer ScienceArXiv
- 2022
We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able…
References
SHOWING 1-10 OF 102 REFERENCES
Hierarchical Relational Attention for Video Question Answering
- Computer Science2018 25th IEEE International Conference on Image Processing (ICIP)
- 2018
The proposed VideoQA model derives attention on temporal segments i.e. video features based on each of the question words to derive the final video representation which leads to a better reasoning capability.
Motion-Appearance Co-memory Networks for Video Question Answering
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
The proposed motion-appearance co-memory network is built on concepts from Dynamic Memory Network (DMN) and introduces new mechanisms for video QA and outperform state-of-the-art significantly on all four tasks of TGIF-QA.
Video Question Answering via Gradually Refined Attention over Appearance and Motion
- Computer ScienceACM Multimedia
- 2017
This paper proposes an end-to-end model which gradually refines its attention over the appearance and motion features of the video using the question as guidance and demonstrates the effectiveness of the model by analyzing the refined attention weights during the question answering procedure.
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
- Computer Science, PhysicsAAAI
- 2019
This work introduces ActivityNet-QA, a fully annotated and large scale VideoQA dataset, which consists of 58,000 QA pairs on 5,800 complex web videos derived from the popular ActivityNet dataset and explores various video representation strategies to improve videoQA performance.
KnowIT VQA: Answering Knowledge-Based Questions about Videos
- Computer ScienceAAAI
- 2020
This work introduces KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom, and proposes a video understanding model by combining the visual and textual video content with specific knowledge about the show.
DeepStory: Video Story QA by Deep Embedded Memory Networks
- Computer ScienceIJCAI
- 2017
A video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data to outperform other QA models.
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation
- Computer ScienceNeurIPS Datasets and Benchmarks
- 2021
Video-And-Language Understanding Evaluation (VALUE) benchmark is introduced, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning, which promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks.
Just Ask: Learning to Answer Questions from Millions of Narrated Videos
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
This work proposes to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision, and introduces iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
In this paper, we propose a novel end-to-end trainable Video Question Answering (VideoQA) framework with three major components: 1) a new heterogeneous memory which can effectively learn global…
Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering
- Computer ScienceAAAI
- 2020
Experimental results and comparisons with the state-of-the-art methods have shown that the proposed Question-Guided Spatio-Temporal Contextual Attention Network (QueST) method can achieve superior performance.