Hierarchical Conditional Relation Networks for Multimodal Video Question Answering

  title={Hierarchical Conditional Relation Networks for Multimodal Video Question Answering},
  author={Thao Minh Le and Vuong Le and Svetha Venkatesh and T. Tran},
  journal={Int. J. Comput. Vis.},
Video QA challenges modelers in multiple fronts. Modeling video necessitates building not only spatio-temporal models for the dynamic visual channel but also multimodal structures for associated information channels such as subtitles or audio. Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of the linguistic query, and composing spatio-temporal concepts and relations in response to the query. To address these requirements, we… 


Hierarchical Relational Attention for Video Question Answering
The proposed VideoQA model derives attention on temporal segments i.e. video features based on each of the question words to derive the final video representation which leads to a better reasoning capability.
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering
In this paper, we propose a novel end-to-end trainable Video Question Answering (VideoQA) framework with three major components: 1) a new heterogeneous memory which can effectively learn global
Motion-Appearance Co-memory Networks for Video Question Answering
The proposed motion-appearance co-memory network is built on concepts from Dynamic Memory Network (DMN) and introduces new mechanisms for video QA and outperform state-of-the-art significantly on all four tasks of TGIF-QA.
Multi-interaction Network with Object Relation for Video Question Answering
A new attention mechanism called multi-interaction, which can capture both element-wise and segment-wise sequence interactions, simultaneously, simultaneously is proposed, which achieves the new state-of-the-art performance.
Multimodal Dual Attention Memory for Video Story Question Answering
The best performance of the dual attention mechanism combined with late fusion by ablation studies are confirmed and MDAM achieves new state-of-the-art results with significant margins compared to the runner-up models.
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.
Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents
A Layered Memory Network (LMN) that represents frame-level and clip-level movie content by the Static Word Memory module and the Dynamic Subtitle Memory module is put forward, which achieves the state-of-the-art performance on the online evaluation task of 'Video+Subtitles'.
Learnable Aggregating Net with Diversity Learning for Video Question Answering
A novel architecture, namely Learnable Aggregating Net with Diversity learning (LAD-Net), for V-VQA, which automatically aggregates adaptively-weighted frame-level features to extract rich video (or question) context semantic information by imitating Bags-of-Words (BoW) quantization.
DeepStory: Video Story QA by Deep Embedded Memory Networks
A video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data to outperform other QA models.
Question-Aware Tube-Switch Network for Video Question Answering
A novel Question- Aware Tube-Switch Network (TSN) for video question answering which contains a Mix module to synchronously combine the appearance and motion representation at time slice level and a Switch module to adaptively choose appearance or motion tube as primary at each reasoning step, guiding the multi-hop reasoning process.