Corpus ID: 220381370

What Gives the Answer Away? Question Answering Bias Analysis on Video QA Datasets

  title={What Gives the Answer Away? Question Answering Bias Analysis on Video QA Datasets},
  author={Jianing Yang and Yuying Zhu and Yongxin Wang and Ruitao Yi and AmirAli Bagher Zadeh and Louis-Philippe Morency},
Question answering biases in video QA datasets can mislead multimodal model to overfit to QA artifacts and jeopardize the model's ability to generalize. Understanding how strong these QA biases are and where they come from helps the community measure progress more accurately and provide researchers insights to debug their models. In this paper, we analyze QA biases in popular video question answering datasets and discover pretrained language models can answer 37-48% questions correctly without… Expand
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
Using AGQA, modern visual reasoning systems are evaluated, demonstrating that the best models barely perform better than non-visual baselines exploiting linguistic biases and that none of the existing models generalize to novel compositions unseen during training. Expand
TrUMAn: Trope Understanding in Movies and Animations
A Trope Understanding and Storytelling (TrUSt) with a new Conceptual Storyteller module, which guides the video encoder by performing video storytelling on a latent space and boosts the model performance and reaches 13.94% performance. Expand
Video Question Answering with Phrases via Semantic Roles
Video Question Answering (VidQA) evaluation metrics have been limited to a single-word answer or selecting a phrase from a fixed set of phrases. These metrics limit the VidQA models’ applicationExpand


RUBi: Reducing Unimodal Biases in Visual Question Answering
RUBi, a new learning strategy to reduce biases in any VQA model, is proposed, which reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image. Expand
Are we Asking the Right Questions in MovieQA?
The biases in the MovieQA dataset are explored and a strikingly simple model which can exploit them are proposed and it is found that using the right word embedding is of utmost importance. Expand
MovieQA: Understanding Stories in Movies through Question-Answering
The MovieQA dataset, which aims to evaluate automatic story comprehension from both video and text, is introduced and existing QA techniques are extended to show that question-answering with such open-ended semantics is hard. Expand
Explicit Bias Discovery in Visual Question Answering Models
This work stores the words of the question, answer and visual words corresponding to regions of interest in attention maps in a database, and runs simple rule mining algorithms on this database to discover human-interpretable rules which give unique insight into the behavior of VQA models. Expand
Revisiting Visual Question Answering Baselines
The results suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers, and a simple alternative model based on binary classification is developed. Expand
Did the Model Understand the Question?
Analysis of state-of-the-art deep learning models for question answering on images, tables, and passages of text finds that these deep networks often ignore important question terms, and demonstrates that attributions can augment standard measures of accuracy and empower investigation of model performance. Expand
TVQA: Localized, Compositional Video Question Answering
This paper presents TVQA, a large-scale video QA dataset based on 6 popular TV shows, and provides analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVZA task. Expand
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
This work balances the popular VQA dataset by collecting complementary images such that every question in this balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Expand
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural languageExpand
BERT Representations for Video Question Answering
This work proposes to use BERT, a sequential modelling technique based on Transformers, to encode the complex semantics from video clips to capture the visual and language information of a video scene by encoding not only the subtitles but also a sequence of visual concepts with a pretrained language-based Transformer. Expand