ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
@article{Yu2019ActivityNetQAAD, title={ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering}, author={Zhou Yu and D. Xu and Jun Yu and Ting Yu and Zhou Zhao and Yueting Zhuang and Dacheng Tao}, journal={ArXiv}, year={2019}, volume={abs/1906.02467} }
Recent developments in modeling language and vision have been successfully applied to image question answering. It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA). Compared to the image domain where large scale and fully annotated benchmark datasets exists, VideoQA datasets are limited to small scale and are automatically generated, etc. These limitations restrict their applicability in practice. Here we introduce…
Figures and Tables from this paper
106 Citations
Learning to Answer Visual Questions from Web Videos
- Computer ScienceIEEE transactions on pattern analysis and machine intelligence
- 2022
This work proposes to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision, and generates a question generation transformer trained on text data to generate question-answer pairs from transcribed video narrations.
Multichannel Attention Refinement for Video Question Answering
- Computer ScienceACM Trans. Multim. Comput. Commun. Appl.
- 2020
Appearance, motion, and audio features are extracted from the video, and question-guided attentions are refined to generate the expressive clues that support the correct answer in VideoQA.
In-the-Wild Video Question Answering
- Computer ScienceCOLING
- 2022
This work proposes WILDQA, a video understanding dataset of videos recorded in outside settings, and introduces the new task of identifying visual support for a given question and answer (Video Evidence Selection).
LifeQA: A Real-life Dataset for Video Question Answering
- Computer ScienceLREC
- 2020
The challenging but realistic aspects of LifeQA are analyzed, and several state-of-the-art video question answering models are applied to provide benchmarks for future research.
WildQA: In-the-Wild Video Question Answering
- Computer ScienceArXiv
- 2022
This work proposes W ILD QA, a video understanding dataset of videos recorded in outside settings, and introduces the new task of identi-fying visual support for a given question and answer (Video Evidence Selection).
Just Ask: Learning to Answer Questions from Millions of Narrated Videos
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
This work proposes to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision, and introduces iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.
Data augmentation techniques for the Video Question Answering task
- Computer ScienceECCV Workshops
- 2020
This work focuses on the Egocentric VideoQA task, which exploits first-person videos, and proposes several augmentation techniques which give a +5.5% improvement on the final accuracy over the considered baseline.
Video Question Answering: Datasets, Algorithms and Challenges
- Computer ScienceEMNLP
- 2022
This survey aims to sort out the recent advances in video question answering (VideoQA) and point towards future directions, including those mainly designed for Factoid QA and those targeted at explicit relation and logic inference.
NEWSKVQA: Knowledge-Aware News Video Question Answering
- Computer SciencePAKDD
- 2022
A novel approach to video question answering is proposed, NEWSKVQA (Knowledge-Aware News Video Question Answering), which performs multi-modal inferencing over textual multiple-choice questions, videos, their transcripts and knowledge base, and presents a strong baseline.
Video Question Answering: a Survey of Models and Datasets
- Computer ScienceMobile Networks and Applications
- 2021
A general framework of VideoQA is proposed, including core processing model, recurrent neural networks (RNNs) encoder and feature fusion, and the ideas and applications of the methods in detail, such as encoder-decoder, attention model, and memory network and other methods.
References
SHOWING 1-10 OF 42 REFERENCES
Leveraging Video Descriptions to Learn Video Question Answering
- Computer Science, PhysicsAAAI
- 2017
A self-paced learning procedure to iteratively identify non-perfect candidate QA pairs and mitigate their effects in training is proposed and shown to be effective and the extended SS model outperforms various baselines.
Video Question Answering via Gradually Refined Attention over Appearance and Motion
- Computer ScienceACM Multimedia
- 2017
This paper proposes an end-to-end model which gradually refines its attention over the appearance and motion features of the video using the question as guidance and demonstrates the effectiveness of the model by analyzing the refined attention weights during the question answering procedure.
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.
Motion-Appearance Co-memory Networks for Video Question Answering
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
The proposed motion-appearance co-memory network is built on concepts from Dynamic Memory Network (DMN) and introduces new mechanisms for video QA and outperform state-of-the-art significantly on all four tasks of TGIF-QA.
Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks
- Computer ScienceIJCAI
- 2018
This paper proposes the adaptive hierarchical encoder network to learn the joint representation of the long-form video contents according to the question with adaptive video segmentation and develops the reinforced decodernetwork to generate the natural language answer for open-ended video question answering.
MovieQA: Understanding Stories in Movies through Question-Answering
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
The MovieQA dataset, which aims to evaluate automatic story comprehension from both video and text, is introduced and existing QA techniques are extended to show that question-answering with such open-ended semantics is hard.
Exploring Models and Data for Image Question Answering
- Computer ScienceNIPS
- 2015
This work proposes to use neural networks and visual semantic embeddings, without intermediate stages such as object detection and image segmentation, to predict answers to simple questions about images, and presents a question generation algorithm that converts image descriptions into QA form.
DeepStory: Video Story QA by Deep Embedded Memory Networks
- Computer ScienceIJCAI
- 2017
A video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data to outperform other QA models.
Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
This paper introduces a proposal method that aims to recover temporal segments containing actions in untrimmed videos and introduces a learning framework to represent and retrieve activity proposals.
Describing Videos by Exploiting Temporal Structure
- Computer Science2015 IEEE International Conference on Computer Vision (ICCV)
- 2015
This work proposes an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions and proposes a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN.