• Corpus ID: 244270027

Learning to Compose Visual Relations

  title={Learning to Compose Visual Relations},
  author={Nan Liu and Shuang Li and Yilun Du and Joshua B. Tenenbaum and Antonio Torralba},
The visual world around us can be described as a structured set of objects and their associated relations. An image of a room may be conjured given only the description of the underlying objects and their associated relations. While there has been significant work on designing deep neural networks which may compose individual objects together, less work has been done on composing the individual relations between objects. A principal difficulty is that while the placement of objects is mutually… 
Compositional Visual Generation with Composable Diffusion Models
The proposed method can generate scenes at test time that are substantially more complex than those seen in training, composing sentence descrip- tions, object relations, human facial attributes, and even generalizing to new combinations that are rarely seen in the real world.
"This is my unicorn, Fluffy": Personalizing frozen vision-language representations
This work introduces a new learning setup called Personalized Vision & Language (PerVL) with two new benchmark datasets for retrieving and segment-ing user-specific (“personalized”) concepts “in the wild” and proposes an architecture for solving PerVL that operates by extending the input vocabulary of a pretrained model with new word embeddings for the new personalized concepts.
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension
The first component of ReCLIP is a region-scoring method that isolates object proposals via cropping and blurring, and passes them to CLIP, but it is found that CLIP is largely incapable of performing spatial reasoning off-the-shelf.
Zero-Shot Text-Guided Object Generation with Dream Fields
This work combines neural rendering with multi-modal image and text representations to synthesize diverse 3D objects solely from natural language descriptions, and introduces simple geometric priors, including sparsity-inducing transmittance regularization, scene bounds, and new MLP architectures.
DT2I: Dense Text-to-Image Generation from Region Descriptions
D dense text-to-image (DT2I) synthesis is introduced as a new task to pave the way toward more intuitive image generation and DTC-GAN, a novel method to generate images from semantically rich region descriptions, and a multi-modal region feature matching loss to encourage semantic image-text matching.
VAEL: Bridging Variational Autoencoders and Probabilistic Logic Programming
This work is the first to propose a general-purpose end-to-end framework integrating probabilistic logic programming into a deep generative model and provides support on the benefits of this neuro-symbolic integration both in terms of task generalization and data efficiency.


Discovering objects and their relations from entangled scene representations
It is shown that RNs are capable of learning object relations from scene description data and can act as a bottleneck that induces the factorization of objects from entangled scene description inputs, and from distributed deep representations of scene images provided by a variational autoencoder.
Exploiting Relationship for Complex-scene Image Generation
This work explores relationship-aware complex-scene image generation, where multiple objects are inter-related as a scene graph and proposes three major updates in the generation framework, which significantly outperforms prior arts in terms of IS and FID metrics.
Image Generation from Scene Graphs
This work proposes a method for generating images from scene graphs, enabling explicitly reasoning about objects and their relationships, and validates this approach on Visual Genome and COCO-Stuff.
Learning What and Where to Draw
This work proposes a new model, the Generative Adversarial What-Where Network (GAWWN), that synthesizes images given instructions describing what content to draw in which location, and shows high-quality 128 x 128 image synthesis on the Caltech-UCSD Birds dataset.
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.
Compositional Visual Generation and Inference with Energy Based Models
This paper shows that energy-based models can exhibit compositional generation abilities of their model by directly combining probability distributions, and demonstrates other unique advantages of the model, such as the ability to continually learn and incorporate new concepts, or infer compositions of concept properties underlying an image.
Learning Canonical Representations for Scene Graph to Image Generation
This work presents a novel model that addresses semantic equivalence issues in graphs by learning canonical graph representations from the data, resulting in improved image generation for complex visual scenes.
PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph
This work proposes a semi-parametric method, PasteGAN, for generating the image from the scene graph and the image crops, where spatial arrangements of the objects and their pair-wise relationships are defined by the scene graphs and the object appearances are determined by the given object crops.
Learning Transferable Visual Models From Natural Language Supervision
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
Using Scene Graph Context to Improve Image Generation
This paper introduces a scene graph context network that pools features generated by a graph convolutional neural network that are then provided to both the image generation network and the adversarial loss and defines two novel evaluation metrics, the relation score and the mean opinion relation score, for this task that directly evaluate scene graph compliance.