VQA-LOL: Visual Question Answering under the Lens of Logic

@article{Gokhale2020VQALOLVQ,
  title={VQA-LOL: Visual Question Answering under the Lens of Logic},
  author={Tejas Gokhale and Pratyay Banerjee and Chitta Baral and Yezhou Yang},
  journal={ArXiv},
  year={2020},
  volume={abs/2002.08325}
}
Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this \textit{Lens of Logic}, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions. We construct an… 
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
TLDR
The first empirical study on the use of MLP architectures for vision-and-language (VL) fusion finds that without pre-training, using MLPs for multimodal fusion has a noticeable performance gap compared to transformers; however, VL pre- training can help close the performance gap; and suggests that MLPs can effectively learn to align vision and text features extracted from lower-level encoders without heavy reliance on self-attention.
Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models
TLDR
Surprisingly, it is found that during dataset collection, non-expert annotators can easily attack SOTA VQA models successfully, revealing the fragility of these models while demonstrating the effectiveness of the adversarial dataset.
Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering
TLDR
It is found that many of the “unknowns” to the learned VQA model are indeed “known” in the dataset implicitly, and a simple data augmentation pipeline SIMPLEAUG is presented to turn this “ known” knowledge into training examples for V QA.
HySTER: A Hybrid Spatio-Temporal Event Reasoner
TLDR
This work defines a method based on general temporal, causal and physics rules which can be transferred across tasks and applies it to the CLEVRER dataset and demonstrates state-of-the-art results in question answering accuracy.
Semantically Distributed Robust Optimization for Vision-and-Language Inference
TLDR
SDRO† is presented, a model-agnostic method that utilizes a set linguistic transformations in a distributed robust optimization setting, along with an ensembling technique to leverage these transformations during inference.
A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
TLDR
This work proposes Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool pre-trained V+L models, and enables universal performance lift for pre- trained models over diverse tasks designed to evaluate broad aspects of robustness.
WeaQA: Weak Supervision via Captions for Visual Question Answering
TLDR
This work presents a method to train models with synthetic Q-A pairs generated procedurally from captions, and demonstrates the efficacy of spatial-pyramid image patches as a simple but effective alternative to dense and costly object bounding box annotations used in existing VQA models.
Self-Supervised VQA: Answering Visual Questions using Images and Captions
TLDR
This work presents a method to train models with procedurally generated Q-A pairs from captions using techniques, such as templates and annotation frameworks like QASRL, which surpass prior supervised methods on VQA-CP and are competitive with methods without object features in fully supervised setting.
RODA: Reverse Operation Based Data Augmentation for Solving Math Word Problems
TLDR
A novel data augmentation method is proposed that reverses the mathematical logic of math word problems to produce new high-quality math problems and introduce new knowledge points that can benefit learning the mathematical reasoning logic.
Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering
  • JianJian Cao, Xiameng Qin, Sanyuan Zhao, Jianbing Shen
  • Computer Science
    ArXiv
  • 2021
TLDR
A Graph Matching Attention (GMA) network that not only builds graph for the image, but also constructsgraph for the question in terms of both syntactic and embedding information and achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
...
1
2
3
...

References

SHOWING 1-10 OF 68 REFERENCES
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
GloVe: Global Vectors for Word Representation
TLDR
A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
TLDR
This work balances the popular VQA dataset by collecting complementary images such that every question in this balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.
The principle of four-cornered negation in indian philosophy
  • The Review of Metaphysics pp. 694–713
  • 1954
Ethics, translated by andrew boyle, introduction by ts gregory
  • 1934
Ethics, translated by andrew boyle, introduction by ts gregory, 1934
  • 1934
Logic-Guided Data Augmentation and Regularization for Consistent Question Answering
TLDR
This paper addresses the problem of improving the accuracy and consistency of responses to comparison questions by integrating logic rules and neural models by leveraging logical and linguistic knowledge to augment labeled training data and then uses a consistency-based regularizer to train the model.
Unified Vision-Language Pre-Training for Image Captioning and VQA
TLDR
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30kCaptions, and VQA 2.0.
Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning
TLDR
This work presents the first work on generating commonsense captions directly from videos, in order to describe latent aspects such as intentions, attributes, and effects, and finetune their commonsense generation models on the V2C-QA task where they ask questions about the latent aspects in the video.
A Corpus for Reasoning about Natural Language Grounded in Photographs
TLDR
This work introduces a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges, and Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge.
...
1
2
3
4
5
...