• Corpus ID: 239050084

Single-Modal Entropy based Active Learning for Visual Question Answering

  title={Single-Modal Entropy based Active Learning for Visual Question Answering},
  author={Dong-Jin Kim and Jae Won Cho and Jinsoo Choi and Yunjae Jung and In-So Kweon},
Constructing a large-scale labeled dataset in the real world, especially for high-level tasks (e.g., Visual Question Answering), can be expensive and time-consuming. In addition, with the ever-growing amounts of data and architecture complexity, Active Learning has become an important aspect of computer vision research. In this work, we address Active Learning in the multi-modal setting of Visual Question Answering (VQA). In light of the multi-modal inputs, image and question, we propose a… 

Figures from this paper


Active Learning for Visual Question Answering: An Empirical Study
It is found that deep VQA models need large amounts of training data before they can start asking informative questions, and all three approaches outperform the random selection baseline and achieve significant query savings.
Dealing with Missing Modalities in the Visual Question Answer-Difference Prediction Task through Knowledge Distillation
This work addresses the issues of the missing modalities that have arisen from the Visual Question AnswerDifference prediction task and finds a novel method to solve the task at hand using a privileged knowledge distillation scheme.
Visual Question Answering as a Meta Learning Task
This work adapts a state-of-the-art VQA model with two techniques from the recent meta learning literature, namely prototypical networks and meta networks, and produces qualitatively distinct results with higher recall of rare answers, and a better sample efficiency that allows training with little initial data.
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
This work balances the popular VQA dataset by collecting complementary images such that every question in this balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.
Visual7W: Grounded Question Answering in Images
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.
Disjoint Multi-task Learning Between Heterogeneous Human-Centric Tasks
This paper proposes a novel alternating directional optimization method to efficiently learn from the heterogeneous data of existing single-task datasets for human action classification and captioning data for efficient human behavior learning.
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
This work extensively evaluates Multimodal Compact Bilinear pooling (MCB) on the visual question answering and grounding tasks and consistently shows the benefit of MCB over ablations without MCB.
Hierarchical Question-Image Co-Attention for Visual Question Answering
This paper presents a novel co-attention model for VQA that jointly reasons about image and question attention in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).
RUBi: Reducing Unimodal Biases in Visual Question Answering
RUBi, a new learning strategy to reduce biases in any VQA model, is proposed, which reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image.
Dual Adversarial Network for Deep Active Learning
This paper investigates the overlapping problem of recent uncertainty-based approaches and proposes a dual adversarial network, namely DAAL, for this purpose, which learns to select the most uncertain and representative data points in one-stage.