Corpus ID: 237503047

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

  title={Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning},
  author={Da Yin and Liunian Harold Li and Ziniu Hu and Nanyun Peng and Kai-Wei Chang},
  • Da Yin, Liunian Harold Li, +2 authors Kai-Wei Chang
  • Published 14 September 2021
  • Computer Science
  • ArXiv
Commonsense is defined as the knowledge that is shared by everyone. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenarios of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning… Expand

Figures and Tables from this paper


From Recognition to Cognition: Visual Commonsense Reasoning
To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. Expand
Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning
This paper introduces Cosmos QA, a large-scale dataset of 35,600 problems that require commonsense-based reading comprehension, formulated as multiple-choice questions, and proposes a new architecture that improves over the competitive baselines. Expand
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
This work presents CommonsenseQA: a challenging new dataset for commonsense question answering, which extracts from ConceptNet multiple target concepts that have the same semantic relation to a single source concept. Expand
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
This work balances the popular VQA dataset by collecting complementary images such that every question in this balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Expand
PIQA: Reasoning about Physical Commonsense in Natural Language
The task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA are introduced and analysis about the dimensions of knowledge that existing models lack are provided, which offers significant opportunities for future research. Expand
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
This paper introduces the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and proposes Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data. Expand
Social IQA: Commonsense Reasoning about Social Interactions
It is established that Social IQa, the first large-scale benchmark for commonsense reasoning about social situations, is challenging for existing question-answering models based on pretrained language models, compared to human performance (>20% gap). Expand
VisualBERT: A Simple and Performant Baseline for Vision and Language
Analysis demonstrates that VisualBERT can ground elements of language to image regions without any explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments. Expand
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
This paper presents Flickr30K Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. Expand
A Corpus for Reasoning about Natural Language Grounded in Photographs
This work introduces a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges, and Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge. Expand