• Corpus ID: 233204465

Fill-in-the-blank as a Challenging Video Understanding Evaluation Framework

  title={Fill-in-the-blank as a Challenging Video Understanding Evaluation Framework},
  author={Santiago Castro and Ruoyao Wang and Ping-Chia Huang and Ian Stewart and Nan Liu and Jonathan C. Stroud and Rada Mihalcea},
Work to date on language-informed video understanding has primarily addressed two tasks: (1) video question answering using multiple-choice questions, where models perform relatively well because they exploit the fact that candidate answers are readily available; and (2) video captioning, which relies on an open-ended evaluation framework that is often inaccurate because system answers may be perceived as incorrect if they differ in form from the ground truth. In this paper, we propose fill-in… 
1 Citations
WhyAct: Identifying Action Reasons in Lifestyle Vlogs
A multimodal model is described that leverages visual and textual information to automatically infer the reasons corresponding to an action presented in the video.


VideoMCC: a New Benchmark for Video Comprehension
Video Multiple Choice Caption (VideoMCC) is formulated as a new well-defined task with an easy-to-interpret performance measure and a varied collection of approaches are proposed and tested on this benchmark for gaining a better understanding of the new challenges posed by video comprehension.
A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering
This task is not solvable by a language model alone, and the model combining 2D and 3D visual information indeed provides the best result, all models perform significantly worse than human-level.
Unifying the Video and Question Attentions for Open-Ended Video Question Answering
This paper proposes a data set for open-ended Video-QA with the automatic question generation approaches, and proposes their sequential video attention and temporal question attention models, which are integrated into the model of unified attention.
Uncovering the Temporal Context for Video Question Answering
An encoder–decoder approach using Recurrent Neural Networks to learn the temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions is presented.
TVQA: Localized, Compositional Video Question Answering
This paper presents TVQA, a large-scale video QA dataset based on 6 popular TV shows, and provides analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVZA task.
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
MovieQA: Understanding Stories in Movies through Question-Answering
The MovieQA dataset, which aims to evaluate automatic story comprehension from both video and text, is introduced and existing QA techniques are extended to show that question-answering with such open-ended semantics is hard.
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
This work introduces ActivityNet-QA, a fully annotated and large scale VideoQA dataset, which consists of 58,000 QA pairs on 5,800 complex web videos derived from the popular ActivityNet dataset and explores various video representation strategies to improve videoQA performance.
DeepStory: Video Story QA by Deep Embedded Memory Networks
A video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data to outperform other QA models.