Hierarchical Conditional Relation Networks for Multimodal Video Question Answering

@article{Le2021HierarchicalCR,
  title={Hierarchical Conditional Relation Networks for Multimodal Video Question Answering},
  author={Thao Minh Le and Vuong Le and Svetha Venkatesh and T. Tran},
  journal={Int. J. Comput. Vis.},
  year={2021},
  volume={129},
  pages={3027-3050}
}
Video QA challenges modelers in multiple fronts. Modeling video necessitates building not only spatio-temporal models for the dynamic visual channel but also multimodal structures for associated information channels such as subtitles or audio. Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of the linguistic query, and composing spatio-temporal concepts and relations in response to the query. To address these requirements, we… Expand

References

SHOWING 1-10 OF 60 REFERENCES
Hierarchical Relational Attention for Video Question Answering
TLDR
The proposed VideoQA model derives attention on temporal segments i.e. video features based on each of the question words to derive the final video representation which leads to a better reasoning capability. Expand
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering
In this paper, we propose a novel end-to-end trainable Video Question Answering (VideoQA) framework with three major components: 1) a new heterogeneous memory which can effectively learn globalExpand
Motion-Appearance Co-memory Networks for Video Question Answering
TLDR
The proposed motion-appearance co-memory network is built on concepts from Dynamic Memory Network (DMN) and introduces new mechanisms for video QA and outperform state-of-the-art significantly on all four tasks of TGIF-QA. Expand
Multi-interaction Network with Object Relation for Video Question Answering
TLDR
A new attention mechanism called multi-interaction, which can capture both element-wise and segment-wise sequence interactions, simultaneously, simultaneously is proposed, which achieves the new state-of-the-art performance. Expand
Multimodal Dual Attention Memory for Video Story Question Answering
TLDR
The best performance of the dual attention mechanism combined with late fusion by ablation studies are confirmed and MDAM achieves new state-of-the-art results with significant margins compared to the runner-up models. Expand
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
TLDR
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks. Expand
Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents
TLDR
A Layered Memory Network (LMN) that represents frame-level and clip-level movie content by the Static Word Memory module and the Dynamic Subtitle Memory module is put forward, which achieves the state-of-the-art performance on the online evaluation task of 'Video+Subtitles'. Expand
Learnable Aggregating Net with Diversity Learning for Video Question Answering
TLDR
A novel architecture, namely Learnable Aggregating Net with Diversity learning (LAD-Net), for V-VQA, which automatically aggregates adaptively-weighted frame-level features to extract rich video (or question) context semantic information by imitating Bags-of-Words (BoW) quantization. Expand
DeepStory: Video Story QA by Deep Embedded Memory Networks
TLDR
A video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data to outperform other QA models. Expand
Question-Aware Tube-Switch Network for Video Question Answering
TLDR
A novel Question- Aware Tube-Switch Network (TSN) for video question answering which contains a Mix module to synchronously combine the appearance and motion representation at time slice level and a Switch module to adaptively choose appearance or motion tube as primary at each reasoning step, guiding the multi-hop reasoning process. Expand
...
1
2
3
4
5
...