Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks
@inproceedings{Zhao2018OpenEndedLV, title={Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks}, author={Zhou Zhao and Zhu Zhang and Shuwen Xiao and Zhou Yu and Jun Yu and Deng Cai and Fei Wu and Yueting Zhuang}, booktitle={IJCAI}, year={2018} }
Open-ended long-form video question answering is challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced long-form video content according to the question. However, the existing video question answering works mainly focus on the short-form video question answering, due to the lack of modeling the semantic representation of long-form video contents. In this paper, we consider the problem of long-form video question…Â
Figures and Tables from this paper
37 Citations
Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks
- Computer ScienceIEEE Transactions on Image Processing
- 2019
A dynamic hierarchical reinforced network for open-ended long-form video question answering is introduced, which employs an encoder–decoder architecture with a dynamic hierarchical encoder and a reinforced decoder to generate natural language answers.
Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks
- Computer ScienceIJCAI
- 2019
A hierarchical convolutional self-attention encoder to efficiently model long-form video contents, which builds the hierarchical structure for video sequences and captures question-aware long-range dependencies from video context and a multi-scale attentive decoder to incorporate multi-layer video representations for answer generation.
Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering
- Computer Science, EducationArXiv
- 2021
An ablation study is performed by changing the existing DramaQA dataset to an openended question answering, and it shows that performance can be improved using video metadata.
Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks
- Computer ScienceIEEE Transactions on Image Processing
- 2019
This paper proposes the hierarchical attention context network for context-aware question understanding by modeling the hierarchically sequential conversation context structure and develops the reinforced decoder network to generate the open-ended natural language answer for multi-turn video question answering.
Learning Question-Guided Video Representation for Multi-Turn Video Question Answering
- Computer ScienceViGIL@NeurIPS
- 2019
This work proposes a proposed question-guided video representation module that efficiently generates the token-level video summary guided by each word in the question that is then fused with the question to generate the answer.
Learning to Answer Visual Questions from Web Videos
- Computer ScienceIEEE transactions on pattern analysis and machine intelligence
- 2022
This work proposes to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision, and generates a question generation transformer trained on text data to generate question-answer pairs from transcribed video narrations.
Spatiotemporal-Textual Co-Attention Network for Video Question Answering
- Computer ScienceACM Trans. Multim. Comput. Commun. Appl.
- 2019
A novel Spatiotemporal-Textual Co-Attention Network (STCA-Net) for video question answering jointly learns spatially and temporally visual attention on videos as well as textual attention on questions.
Video Question Answering: a Survey of Models and Datasets
- Computer ScienceMob. Networks Appl.
- 2021
A general framework of VideoQA is proposed, including core processing model, recurrent neural networks (RNNs) encoder and feature fusion, and the ideas and applications of the methods in detail, such as encoder-decoder, attention model, and memory network and other methods.
Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network
- Computer ScienceACM Trans. Multim. Comput. Commun. Appl.
- 2019
A knowledge-based progressive spatial-temporal attention network is proposed to tackle the problem of video question answering by taking the spatial and temporal dimension of video content into account and employing an external knowledge base to improve the answering ability of the network.
End-to-End Video Question-Answer Generation With Generator-Pretester Network
- Computer ScienceIEEE Transactions on Circuits and Systems for Video Technology
- 2021
A novel model Generator-Pretester Network that focuses on two components: The Joint Question-Answer Generator (JQAG) which generates a question with its corresponding answer to allow Video Question “Answering” training and the Pretester (PT) verifies a generated question by trying to answer it and checks the pretested answer with both the model’s proposed answer and the ground truth answer.
References
SHOWING 1-10 OF 31 REFERENCES
Video Question Answering via Hierarchical Spatio-Temporal Attention Networks
- Computer ScienceIJCAI
- 2017
This paper proposes the hierarchical spatio-temporal attention network for learning the joint representation of the dynamic video contents according to the given question and develops the spatiospecific encoder-decoder learning method with multi-step reasoning process for open-ended video question answering.
Leveraging Video Descriptions to Learn Video Question Answering
- Computer Science, PhysicsAAAI
- 2017
A self-paced learning procedure to iteratively identify non-perfect candidate QA pairs and mitigate their effects in training is proposed and shown to be effective and the extended SS model outperforms various baselines.
Uncovering Temporal Context for Video Question and Answering
- Computer ScienceArXiv
- 2015
An encoder-decoder approach using Recurrent Neural Networks to learn temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions is presented.
Visual Question Answering with Question Representation Update (QRU)
- Computer ScienceNIPS
- 2016
This model contains several reasoning layers, exploiting complex visual relations in the visual question answering (VQA) task, end-to-end trainable through back-propagation, where its weights are initialized using pre-trained convolutional neural network (CNN) and gated recurrent unit (GRU).
MovieQA: Understanding Stories in Movies through Question-Answering
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
The MovieQA dataset, which aims to evaluate automatic story comprehension from both video and text, is introduced and existing QA techniques are extended to show that question-answering with such open-ended semantics is hard.
VQA: Visual Question Answering
- Computer Science2015 IEEE International Conference on Computer Vision (ICCV)
- 2015
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language…
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.
Stacked Attention Networks for Image Question Answering
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
A multiple-layer SAN is developed in which an image is queried multiple times to infer the answer progressively, and the progress that the SAN locates the relevant visual clues that lead to the answer of the question layer-by-layer.
Visual question answering: A survey of methods and datasets
- Computer ScienceComput. Vis. Image Underst.
- 2017
Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
This paper proposes a new approach, namely Hierarchical Recurrent Neural Encoder (HRNE), to exploit temporal information of videos to exploit video temporal structure in a longer range by reducing the length of input information flow, and compositing multiple consecutive inputs at a higher level.