Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks

@inproceedings{Zhao2018OpenEndedLV,
  title={Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks},
  author={Zhou Zhao and Zhu Zhang and Shuwen Xiao and Zhou Yu and Jun Yu and Deng Cai and Fei Wu and Yueting Zhuang},
  booktitle={IJCAI},
  year={2018}
}
Open-ended long-form video question answering is challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced long-form video content according to the question. However, the existing video question answering works mainly focus on the short-form video question answering, due to the lack of modeling the semantic representation of long-form video contents. In this paper, we consider the problem of long-form video question… 

Figures and Tables from this paper

Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks
TLDR
A dynamic hierarchical reinforced network for open-ended long-form video question answering is introduced, which employs an encoder–decoder architecture with a dynamic hierarchical encoder and a reinforced decoder to generate natural language answers.
Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks
TLDR
A hierarchical convolutional self-attention encoder to efficiently model long-form video contents, which builds the hierarchical structure for video sequences and captures question-aware long-range dependencies from video context and a multi-scale attentive decoder to incorporate multi-layer video representations for answer generation.
Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering
TLDR
An ablation study is performed by changing the existing DramaQA dataset to an openended question answering, and it shows that performance can be improved using video metadata.
Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks
TLDR
This paper proposes the hierarchical attention context network for context-aware question understanding by modeling the hierarchically sequential conversation context structure and develops the reinforced decoder network to generate the open-ended natural language answer for multi-turn video question answering.
Learning Question-Guided Video Representation for Multi-Turn Video Question Answering
TLDR
This work proposes a proposed question-guided video representation module that efficiently generates the token-level video summary guided by each word in the question that is then fused with the question to generate the answer.
Learning to Answer Visual Questions from Web Videos
TLDR
This work proposes to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision, and generates a question generation transformer trained on text data to generate question-answer pairs from transcribed video narrations.
Spatiotemporal-Textual Co-Attention Network for Video Question Answering
TLDR
A novel Spatiotemporal-Textual Co-Attention Network (STCA-Net) for video question answering jointly learns spatially and temporally visual attention on videos as well as textual attention on questions.
Video Question Answering: a Survey of Models and Datasets
TLDR
A general framework of VideoQA is proposed, including core processing model, recurrent neural networks (RNNs) encoder and feature fusion, and the ideas and applications of the methods in detail, such as encoder-decoder, attention model, and memory network and other methods.
Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network
TLDR
A knowledge-based progressive spatial-temporal attention network is proposed to tackle the problem of video question answering by taking the spatial and temporal dimension of video content into account and employing an external knowledge base to improve the answering ability of the network.
End-to-End Video Question-Answer Generation With Generator-Pretester Network
TLDR
A novel model Generator-Pretester Network that focuses on two components: The Joint Question-Answer Generator (JQAG) which generates a question with its corresponding answer to allow Video Question “Answering” training and the Pretester (PT) verifies a generated question by trying to answer it and checks the pretested answer with both the model’s proposed answer and the ground truth answer.
...
...

References

SHOWING 1-10 OF 31 REFERENCES
Video Question Answering via Hierarchical Spatio-Temporal Attention Networks
TLDR
This paper proposes the hierarchical spatio-temporal attention network for learning the joint representation of the dynamic video contents according to the given question and develops the spatiospecific encoder-decoder learning method with multi-step reasoning process for open-ended video question answering.
Leveraging Video Descriptions to Learn Video Question Answering
TLDR
A self-paced learning procedure to iteratively identify non-perfect candidate QA pairs and mitigate their effects in training is proposed and shown to be effective and the extended SS model outperforms various baselines.
Uncovering Temporal Context for Video Question and Answering
TLDR
An encoder-decoder approach using Recurrent Neural Networks to learn temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions is presented.
Visual Question Answering with Question Representation Update (QRU)
TLDR
This model contains several reasoning layers, exploiting complex visual relations in the visual question answering (VQA) task, end-to-end trainable through back-propagation, where its weights are initialized using pre-trained convolutional neural network (CNN) and gated recurrent unit (GRU).
MovieQA: Understanding Stories in Movies through Question-Answering
TLDR
The MovieQA dataset, which aims to evaluate automatic story comprehension from both video and text, is introduced and existing QA techniques are extended to show that question-answering with such open-ended semantics is hard.
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
TLDR
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.
Stacked Attention Networks for Image Question Answering
TLDR
A multiple-layer SAN is developed in which an image is queried multiple times to infer the answer progressively, and the progress that the SAN locates the relevant visual clues that lead to the answer of the question layer-by-layer.
Visual question answering: A survey of methods and datasets
Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning
TLDR
This paper proposes a new approach, namely Hierarchical Recurrent Neural Encoder (HRNE), to exploit temporal information of videos to exploit video temporal structure in a longer range by reducing the length of input information flow, and compositing multiple consecutive inputs at a higher level.
...
...