Generating Question Relevant Captions to Aid Visual Question Answering

@inproceedings{Wu2019GeneratingQR,
  title={Generating Question Relevant Captions to Aid Visual Question Answering},
  author={Jialin Wu and Zeyuan Hu and Raymond J. Mooney},
  booktitle={ACL},
  year={2019}
}
Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision. We present a novel approach to better VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question. The model is trained using an existing caption dataset by automatically determining question-relevant captions using an online gradient-based method. Experimental results on the VQA v2 challenge… 

Figures and Tables from this paper

Latent Variable Models for Visual Question Answering

  • Zixu WangYishu MiaoLucia Specia
  • Computer Science
    2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
  • 2021
TLDR
This work proposes latent variable models for VQA where extra information is incorporated as latent variables to improve inference, which in turn benefits questionanswering performance.

Self-Critical Reasoning for Robust Visual Question Answering

TLDR
This work introduces a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates.

All You May Need for VQA are Image Captions

TLDR
This paper proposes a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation, and shows that the resulting data is of high-quality.

Learning to Ask Informative Sub-Questions for Visual Question Answering

TLDR
This work proposes a novel VQA model that generates questions to actively obtain auxiliary perceptual information useful for correct reasoning, and shows that by inputting the generated questions and their answers as additional information to the V QA model, it can indeed predict the answer more correctly than the baseline model.

CapWAP: Captioning with a Purpose

TLDR
It is demonstrated that under a variety of scenarios the purposeful captioning system learns to anticipate and fulfill specific information needs better than its generic counterparts, as measured by QA performance on user questions from unseen images, when using the caption alone as context.

Visual Question Answering Using Semantic Information from Image Descriptions

TLDR
A deep neural network model is proposed that uses an attention mechanism which utilizes image features, the natural language question asked and semantic knowledge extracted from the image to produce open-ended answers for the given questions.

LPF: A Language-Prior Feedback Objective Function for De-biased Visual Question Answering

TLDR
A novel Language-Prior Feedback (LPF) objective function, to re-balance the proportion of each answer's loss value in the total VQA loss, that achieves competitive performance on the bias-sensitive V QA-CP v2 benchmark.

Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering

TLDR
It is found that many of the “unknowns” to the learned VQA model are indeed “known” in the dataset implicitly, and a simple data augmentation pipeline SimpleAug is presented to turn this “ known” knowledge into training examples for V QA.

Deep Cross of Intra and Inter Modalities for Visual Question Answering

  • Rishav Bhardwaj
  • Computer Science
    Proceedings of the 3rd International Conference on Integrated Intelligent Computing Communication & Security (ICIIC 2021)
  • 2021
TLDR
The main idea behind this architecture is to take the positioning of each feature into account and then recognize the relationship between multi-modal features as well as establishing a relationship among themselves in order to learn them in a better way.

References

SHOWING 1-10 OF 36 REFERENCES

Self-Critical Reasoning for Robust Visual Question Answering

TLDR
This work introduces a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates.

VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions

TLDR
This work proposes a new task of VQA-E (VQA with Explanation), where the computational models are required to generate an explanation with the predicted answer, and quantitatively shows that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction.

Exploring Models and Data for Image Question Answering

TLDR
This work proposes to use neural networks and visual semantic embeddings, without intermediate stages such as object detection and image segmentation, to predict answers to simple questions about images, and presents a question generation algorithm that converts image descriptions into QA form.

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language

R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering

TLDR
A novel framework to learn visual relation facts for VQA is proposed and a multi-step attention model composed of visual attention and semantic attention sequentially to extract related visual knowledge and semantic knowledge is proposed.

Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and Captions

TLDR
This work proposes to break up the end-to-end VQA into two steps: explaining and reasoning, in an attempt towards a more explainable VZA by shedding light on the intermediate results between these two steps.

Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering

TLDR
This work develops an entity graph and uses a graph convolutional network to `reason' about the correct answer by jointly considering all entities and shows that this leads to an improvement in accuracy of around 7% compared to the state of the art.

Faithful Multimodal Explanation for Visual Question Answering

TLDR
This paper presents a novel approach to developing a high-performing VQA system that can elucidate its answers with integrated textual and visual explanations that faithfully reflect important aspects of its underlying reasoning while capturing the style of comprehensible human explanations.

Discriminability Objective for Training Descriptive Captions

TLDR
By incorporating into the captioning training objective a loss component directly related to ability (by a machine) to disambiguate image/caption matches, this work obtains systems that produce much more discriminative caption, according to human evaluation.

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

TLDR
This work extensively evaluates Multimodal Compact Bilinear pooling (MCB) on the visual question answering and grounding tasks and consistently shows the benefit of MCB over ablations without MCB.