Abstract Visual Reasoning with Tangram Shapes

  title={Abstract Visual Reasoning with Tangram Shapes},
  author={Anya Ji and Noriyuki Kojima and Noah Rush and Alane Suhr and Wai Keen Vong and Robert D. Hawkins and Yoav Artzi},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
We introduce KiloGram, a resource for studying abstract visual reasoning in humans and machines. Drawing on the history of tangram puzzles as stimuli in cognitive science, we build a richly annotated dataset that, with >1k distinct stimuli, is orders of magnitude larger and more diverse than prior resources. It is both visually and linguistically richer, moving beyond whole shape descriptions to include segmentation maps and part labels. We use this resource to evaluate the abstract visual… 

Do language models have coherent mental models of everyday things?

A simple extension to pre-trained language models like GPT-3 and Macaw is proposed where a constraint satisfaction layer is applied on top of raw predictions from LMs to produce more consistent and accurate parts mental models of everyday things.



Natural Reference to Objects in a Visual Domain

This paper constructs a study designed to elicit naturalistic referring expressions to relatively complex objects, and finds aspects of reference that have not been accounted for in work on Referring Expression Generation (REG).

A Corpus for Reasoning about Natural Language Grounded in Photographs

This work introduces a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges, and Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a

Common object representations for visual recognition and production

It is found that repeatedly sketched objects were better recognized after training, while sketches of unpracticed but similar objects worsened, showing that visual production can reshape the representational space for objects: by differentiating trained objects and merging other nearby objects in the space.

A Corpus of Natural Language for Visual Reasoning

A method of crowdsourcing linguistically-diverse data, and an analysis of the data demonstrates a broad set of linguistic phenomena, requiring visual and set-theoretic reasoning.

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

A novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which is called Winoground, and it is found that, surprisingly, none of them do much better than chance.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

Learning to communicate about shared procedural abstractions

The results shed light on the inductive biases that enable intelligent agents to coordinate upon shared procedural abstractions and propose that concepts may be represented by structured programs written in a domain-specific language (DSL).

Generation and Comprehension of Unambiguous Object Descriptions

This work proposes a method that can generate an unambiguous description of a specific object or region in an image and which can also comprehend or interpret such an expression to infer which object is being described, and shows that this method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene.

MultiPic: A standardized set of 750 drawings with norms for six European languages

A new set of 750 colored pictures of concrete concepts, MultiPic, constitutes a new valuable tool for cognitive scientists investigating language, visual perception, memory and/or attention in monolingual or multilingual populations.