Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering

@article{Kil2021DiscoveringTU,
  title={Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering},
  author={Jihyung Kil and Cheng Zhang and Dong Xuan and Wei-Lun Chao},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.06122}
}
Visual question answering (VQA) is challenging not only because the model has to handle multi-modal information, but also because it is just so hard to collect sufficient training examples — there are too many questions one can ask about an image. As a result, a VQA model trained solely on human-annotated examples could easily over-fit specific question styles or image contents that are being asked, leaving the model largely ignorant about the sheer diversity of questions. Existing methods… 

Figures and Tables from this paper

Rethinking Data Augmentation for Robust Visual Question Answering

A model-agnostic DA strategy that can be seamlessly incorporated into any VQA architecture, and a knowledge distillation based answer assignment to generate pseudo answers for all composed image-question pairs, which are robust to both in-domain and out-of-distribution settings.

From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

Img2Prompt is a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training.

All You May Need for VQA are Image Captions

This paper proposes a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation, and shows that the resulting data is of high-quality.

Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering

It is suggested that incremental learning of language 010 reasoning skills is more difficult than incrementally 012 learning visual categories and is related to task similarity, where heterogeneous tasks lead to more severe for- 015 getting.