ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

@inproceedings{Yu2019ActivityNetQAAD,
  title={ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering},
  author={Zhou Yu and D. Xu and Jun Yu and Ting Yu and Zhou Zhao and Yueting Zhuang and D. Tao},
  booktitle={AAAI},
  year={2019}
}
Recent developments in modeling language and vision have been successfully applied to image question answering. It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA). Compared to the image domain where large scale and fully annotated benchmark datasets exists, VideoQA datasets are limited to small scale and are automatically generated, etc. These limitations restrict their applicability in practice. Here we introduce… Expand
Multichannel Attention Refinement for Video Question Answering
TLDR
Appearance, motion, and audio features are extracted from the video, and question-guided attentions are refined to generate the expressive clues that support the correct answer in VideoQA. Expand
LifeQA: A Real-life Dataset for Video Question Answering
TLDR
The challenging but realistic aspects of LifeQA are analyzed, and several state-of-the-art video question answering models are applied to provide benchmarks for future research. Expand
Video Question Answering: a Survey of Models and Datasets
TLDR
A general framework of VideoQA is proposed, including core processing model, recurrent neural networks (RNNs) encoder and feature fusion, and the ideas and applications of the methods in detail, such as encoder-decoder, attention model, and memory network and other methods. Expand
Data augmentation techniques for the Video Question Answering task
TLDR
This work focuses on the Egocentric VideoQA task, which exploits first-person videos, and proposes several augmentation techniques which give a +5.5% improvement on the final accuracy over the considered baseline. Expand
Two-Stream Spatiotemporal Compositional Attention Network for VideoQA
TLDR
A two-stream spatiotemporal compositional attention network that achieves sophisticated multi-step spatiotmporal reasoning by using both motion and detailed appearance features and progressively refines internal representation and infers the answer via multiple reasoning steps. Expand
Graph-Based Multi-Interaction Network for Video Question Answering
TLDR
A graph-based relation-aware neural network is proposed to explore a more fine-grained visual representation, which could explore the relationships and dependencies between objects spatially and temporally in videos. Expand
Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering
TLDR
Experimental results and comparisons with the state-of-the-art methods have shown that the proposed Question-Guided Spatio-Temporal Contextual Attention Network (QueST) method can achieve superior performance. Expand
Natural Language Video Localization: A Revisit in Span-based Question Answering Framework
TLDR
This study suggests that the span-based QA framework is an effective strategy to solve the NLVL problem by applying a multi-scale split-and-concatenation strategy to locate the target moment accurately. Expand
Compositional Attention Networks With Two-Stream Fusion for Video Question Answering
TLDR
The compositional attention module is the core of CAN and can be seen as a modular combination of a unified attention block and with different fusion strategies, the model achieves new state-of-the-art results on all the datasets. Expand
End-to-End Video Question-Answer Generation with Generator-Pretester Network
TLDR
A novel model Generator-Pretester Network that focuses on two components: The Joint Question-Answer Generator (JQAG) which generates a question with its corresponding answer to allow Video Question "Answering" training and the Pretester (PT) verifies a generated question by trying to answer it and checks the pretested answer with both the model's proposed answer and the ground truth answer. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 57 REFERENCES
Leveraging Video Descriptions to Learn Video Question Answering
TLDR
A scalable approach to learn video-based question answering (QA): answer a "free-form natural language question" about a video content and a self-paced learning procedure to iteratively identify non-perfect candidate QA pairs is proposed and shown to be effective. Expand
Video Question Answering via Gradually Refined Attention over Appearance and Motion
TLDR
This paper proposes an end-to-end model which gradually refines its attention over the appearance and motion features of the video using the question as guidance and demonstrates the effectiveness of the model by analyzing the refined attention weights during the question answering procedure. Expand
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
TLDR
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks. Expand
Motion-Appearance Co-memory Networks for Video Question Answering
TLDR
The proposed motion-appearance co-memory network is built on concepts from Dynamic Memory Network (DMN) and introduces new mechanisms for video QA and outperform state-of-the-art significantly on all four tasks of TGIF-QA. Expand
DeepStory: Video Story QA by Deep Embedded Memory Networks
TLDR
A video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data to outperform other QA models. Expand
MovieQA: Understanding Stories in Movies through Question-Answering
TLDR
The MovieQA dataset, which aims to evaluate automatic story comprehension from both video and text, is introduced and existing QA techniques are extended to show that question-answering with such open-ended semantics is hard. Expand
Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks
TLDR
The adaptive hierarchical encoder network is proposed to learn the joint representation of the longform video contents according to the question with adaptive video segmentation and the reinforced decoder network to generate the natural language answer for open-ended video question answering is developed. Expand
Exploring Models and Data for Image Question Answering
TLDR
This work proposes to use neural networks and visual semantic embeddings, without intermediate stages such as object detection and image segmentation, to predict answers to simple questions about images, and presents a question generation algorithm that converts image descriptions into QA form. Expand
Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos
TLDR
This paper introduces a proposal method that aims to recover temporal segments containing actions in untrimmed videos and introduces a learning framework to represent and retrieve activity proposals. Expand
Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network
TLDR
This paper proposes the hierarchical attention context network for context-aware question understanding by modeling the hierarchically sequential conversation context structure and develops the multi-stream spatio-temporal attention network for learning the joint representation of the dynamic video contents and context- aware question embedding. Expand
...
1
2
3
4
5
...