Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network

@inproceedings{Zhao2018MultiTurnVQ,
  title={Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network},
  author={Zhou Zhao and Xinghua Jiang and Deng Cai and Jun Xiao and Xiaofei He and Shiliang Pu},
  booktitle={IJCAI},
  year={2018}
}
Conversational video question answering is a challenging task in visual information retrieval, which generates the accurate answer from the referenced video contents according to the visual conversation context and given question. [] Key Method We first propose the hierarchical attention context network for context-aware question understanding by modeling the hierarchically sequential conversation context structure.

Figures and Tables from this paper

Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks
TLDR
This paper proposes the hierarchical attention context network for context-aware question understanding by modeling the hierarchically sequential conversation context structure and develops the reinforced decoder network to generate the open-ended natural language answer for multi-turn video question answering.
Multi-Turn Video Question Generation via Reinforced Multi-Choice Attention Network
TLDR
A new framework for single-turn VQG is proposed, which introduces attention mechanism to process inference of dialog history and selection mechanism to choose from the candidate questions generated by each round of dialog History.
Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering
TLDR
Better performances of the proposed novel Temporal Pyramid Transformer model with multimodal interaction for VideoQA are demonstrated in comparison with the state-of-the-arts.
Video Dialog via Multi-Grained Convolutional Self-Attention Context Multi-Modal Networks
TLDR
A novel approach for video dialog called multi-grained convolutional self-attention context network, which combines video information with dialog history and a hierarchical dialog history encoder is designed to learn the context-aware question representation.
Compositional Attention Networks With Two-Stream Fusion for Video Question Answering
TLDR
The compositional attention module is the core of CAN and can be seen as a modular combination of a unified attention block and with different fusion strategies, the model achieves new state-of-the-art results on all the datasets.
Video Dialog via Multi-Grained Convolutional Self-Attention Context Networks
TLDR
A novel approach for video dialog called multi-grained convolutional self-attention context network, which combines video information with dialog history and can achieve higher time efficiency and the extensive experiments also show the effectiveness of the method.
Long-Term Video Question Answering via Multimodal Hierarchical Memory Attentive Networks
TLDR
Experimental results demonstrate that the proposed approach significantly outperforms other state-of-the-art methods for long-term videos answering, and extensive ablation studies are carried out to explore the reasons behind the proposed model’s effectiveness.
Graph-Based Multi-Interaction Network for Video Question Answering
TLDR
A graph-based relation-aware neural network is proposed to explore a more fine-grained visual representation, which could explore the relationships and dependencies between objects spatially and temporally in videos.
Video Dialog via Progressive Inference and Cross-Transformer
TLDR
This paper introduces a novel progressive inference mechanism for video dialog, which progressively updates query information based on dialog history and video content until the agent think the information is sufficient and unambiguous.
Multi-Question Learning for Visual Question Answering
TLDR
An effective VQA framework is proposed and a training procedure for MQL is designed, where the specifically designed attention network models the relation between input video and corresponding questions, enabling multiple video-question pairs to be co-trained.
...
...

References

SHOWING 1-10 OF 29 REFERENCES
Video Question Answering via Hierarchical Spatio-Temporal Attention Networks
TLDR
This paper proposes the hierarchical spatio-temporal attention network for learning the joint representation of the dynamic video contents according to the given question and develops the spatiospecific encoder-decoder learning method with multi-step reasoning process for open-ended video question answering.
Uncovering Temporal Context for Video Question and Answering
TLDR
An encoder-decoder approach using Recurrent Neural Networks to learn temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions is presented.
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
TLDR
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.
Leveraging Video Descriptions to Learn Video Question Answering
TLDR
A self-paced learning procedure to iteratively identify non-perfect candidate QA pairs and mitigate their effects in training is proposed and shown to be effective and the extended SS model outperforms various baselines.
Stacked Attention Networks for Image Question Answering
TLDR
A multiple-layer SAN is developed in which an image is queried multiple times to infer the answer progressively, and the progress that the SAN locates the relevant visual clues that lead to the answer of the question layer-by-layer.
Answer Sequence Learning with Neural Networks for Answer Selection in Community Question Answering
TLDR
This approach applies convolution neural networks to learning the joint representation of question-answer pair firstly, and then uses the joint representations as input of the long short-term memory (LSTM) to learn the answer sequence of a question for labeling the matching quality of each answer.
Visual Dialog
TLDR
A retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response, and a family of neural encoder-decoder models, which outperform a number of sophisticated baselines.
Visual Question Answering with Question Representation Update (QRU)
TLDR
This model contains several reasoning layers, exploiting complex visual relations in the visual question answering (VQA) task, end-to-end trainable through back-propagation, where its weights are initialized using pre-trained convolutional neural network (CNN) and gated recurrent unit (GRU).
MovieQA: Understanding Stories in Movies through Question-Answering
TLDR
The MovieQA dataset, which aims to evaluate automatic story comprehension from both video and text, is introduced and existing QA techniques are extended to show that question-answering with such open-ended semantics is hard.
...
...