VQA With No Questions-Answers Training

  title={VQA With No Questions-Answers Training},
  author={Ben Zion Vatashsky and Shimon Ullman},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  • B. Vatashsky, S. Ullman
  • Published 2020
  • Computer Science
  • 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Methods for teaching machines to answer visual questions have made significant progress in the last few years, but although demonstrating impressive results on particular datasets, these methods lack some important human capabilities, including integrating new visual classes and concepts in a modular manner, providing explanations for the answer and handling new domains without new examples. [...] Key Method The system includes a question representation stage followed by an answering procedure, which invokes an…Expand
Domain-robust VQA with diverse datasets and methods but no target labels
This work quantifies domain shifts between popular VQA datasets, and focuses on unsupervised domain adaptation and the open-ended classification task formulation to emulate the setting of real-world generalization. Expand
Interpretable visual reasoning: A survey
A taxonomy based on four explanation forms of vision, text, graph and symbol used in current visual reasoning is established and the challenges for IVR are summarized and potential research directions are pointed out. Expand
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
Using AGQA, modern visual reasoning systems are evaluated, demonstrating that the best models barely perform better than non-visual baselines exploiting linguistic biases and that none of the existing models generalize to novel compositions unseen during training. Expand
Image interpretation by iterative bottom-up top-down processing
A model in which meaningful scene structures are extracted from the image by an iterative process, combining bottom-up (BU) and top-down (TD) networks, interacting through a symmetric bi-directional communication between them (‘counter-streams’ structure). Expand
Just Ask: Learning to Answer Questions from Millions of Narrated Videos
This work proposes to automatically generate question-answer pairs from transcribed video narrations leveraging a state-of-the-art text transformer pipeline and obtain a new large-scale VideoQA training dataset with reduced language biases and high quality annotations. Expand


Visual question answering: A survey of methods and datasets
The state of the art by comparing modern approaches to VQA, and the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space are examined. Expand
FVQA: Fact-Based Visual Question Answering
A conventional visual question answering dataset is extended, which contains image-question-answer triplets, through additional image- question-answer-supporting fact tuples, and a novel model is described which is capable of reasoning about an image on the basis of supporting-facts. Expand
Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources
A method for visual question answering which combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions and is specifically able to answer questions posed in natural language, that refer to information not contained in the image. Expand
VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions
This work proposes a new task of VQA-E (VQA with Explanation), where the computational models are required to generate an explanation with the predicted answer, and quantitatively shows that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction. Expand
Survey of Recent Advances in Visual Question Answering
The paper describes the approaches taken by various algorithms to extract image features, text features and the way these are employed to predict answers and briefly discusses the experiments performed to evaluate the VQA models. Expand
Visual Question Answering as Reading Comprehension
This paper proposes to unify all the input information by natural language so as to convert VQA into a machine reading comprehension problem, which is a step towards being able to exploit large volumes of text and natural language processing techniques to address V QA problem. Expand
Quantifying and Alleviating the Language Prior Problem in Visual Question Answering
Experimental results show that the score regularization module can not only effectively reduce the language prior problem of these VQA models but also consistently improve their question answering accuracy. Expand
Cycle-Consistency for Robust Visual Question Answering
A model-agnostic framework is proposed that trains a model to not only answer a question, but also generate a question conditioned on the answer, such that the answer predicted for the generated question is the same as the ground truth answer to the original question. Expand
Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks
A novel framework is proposed which endows the model capabilities in answering more complex questions by leveraging massive external knowledge with dynamic memory networks and can also answer open-domain questions effectively by leveraging the external knowledge. Expand
The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions
The core of the proposed method is a new co-attention model that learns how to exploit a set of external off-the-shelf algorithms to achieve its goal, an approach that has something in common with the Neural Turing Machine. Expand