Image Generation from Scene Graphs

@article{Johnson2018ImageGF,
  title={Image Generation from Scene Graphs},
  author={Justin Johnson and Agrim Gupta and Li Fei-Fei},
  journal={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2018},
  pages={1219-1228}
}
To truly understand the visual world our models should be able not only to recognize images but also generate them. [] Key Method Our model uses graph convolution to process input graphs, computes a scene layout by predicting bounding boxes and segmentation masks for objects, and converts the layout to an image with a cascaded refinement network. The network is trained adversarially against a pair of discriminators to ensure realistic outputs. We validate our approach on Visual Genome and COCO-Stuff, where…
Interactive Image Generation Using Scene Graphs
TLDR
This work proposes a method to generate an image incrementally based on a sequence of graphs of scene descriptions (scene-graphs) that preserves the image content generated in previous steps and modifies the cumulative image as per the newly provided scene information.
Scene Graph Generation for Better Image Captioning?
TLDR
This work proposes a model that leverages detected objects and auto-generated visual relationships to describe images in natural language to outperform existing state-of-the-art end-to-end models that generate image descriptions directly from raw input pixels.
Using Scene Graph Context to Improve Image Generation
TLDR
This paper introduces a scene graph context network that pools features generated by a graph convolutional neural network that are then provided to both the image generation network and the adversarial loss and defines two novel evaluation metrics, the relation score and the mean opinion relation score, for this task that directly evaluate scene graph compliance.
Heuristics for Image Generation from Scene Graphs
TLDR
This paper uses visual heuristics to augment relationships between pairs of objects and introduces a graph convolution-based network to generate a scene graph context representation that enriches the image generation.
Learning Canonical Representations for Scene Graph to Image Generation
TLDR
This work presents a novel model that addresses semantic equivalence issues in graphs by learning canonical graph representations from the data, resulting in improved image generation for complex visual scenes.
Transforming Image Generation from Scene Graphs
TLDR
A transformer-based approach conditioned by scene graphs that also employs a decoder to autoregressively compose images, making the synthesis process more effective and control-lable and results obtained show that the model is able to satisfy semantic constraints defined by a scene graph.
Learning to Generate Scene Graph from Natural Language Supervision
TLDR
This paper proposes one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph, and designs a Transformer-based model to predict these "pseudo" labels via a masked token prediction task.
Text Pared into Scene Graph for Diverse Image Generation
TLDR
A module of text description parsed into scene graph is proposed, which can generate reasonable scene layout to ensure the generated image and object realistic and enhances the interaction between objects and global semantics by concatenates each object embedding with text embedding.
Exploiting Relationship for Complex-scene Image Generation
TLDR
This work explores relationship-aware complex-scene image generation, where multiple objects are inter-related as a scene graph and proposes three major updates in the generation framework, which significantly outperforms prior arts in terms of IS and FID metrics.
A Case for Object Compositionality in Deep Generative Models of Images
TLDR
This work proposes to structure the generator of a GAN to consider objects and their relations explicitly, and generate images by means of composition, which provides a way to efficiently learn a more accurate generative model of real-world images, and serves as an initial step towards learning corresponding object representations.
...
...

References

SHOWING 1-10 OF 65 REFERENCES
Learning What and Where to Draw
TLDR
This work proposes a new model, the Generative Adversarial What-Where Network (GAWWN), that synthesizes images given instructions describing what content to draw in which location, and shows high-quality 128 x 128 image synthesis on the Caltech-UCSD Birds dataset.
Image-to-Image Translation with Conditional Adversarial Networks
TLDR
Conditional adversarial networks are investigated as a general-purpose solution to image-to-image translation problems and it is demonstrated that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.
Pixels to Graphs by Associative Embedding
TLDR
A method for training a convolutional neural network such that it takes in an input image and produces a full graph definition and is done end-to-end in a single stage with the use of associative embeddings.
Conditional Image Generation with PixelCNN Decoders
TLDR
The gated convolutional layers in the proposed model improve the log-likelihood of PixelCNN to match the state-of-the-art performance of PixelRNN on ImageNet, with greatly reduced computational cost.
StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks
TLDR
This paper proposes Stacked Generative Adversarial Networks (StackGAN) to generate 256 photo-realistic images conditioned on text descriptions and introduces a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold.
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
TLDR
The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.
Photographic Image Synthesis with Cascaded Refinement Networks
  • Qifeng ChenV. Koltun
  • Computer Science
    2017 IEEE International Conference on Computer Vision (ICCV)
  • 2017
TLDR
It is shown that photographic images can be synthesized from semantic layouts by a single feedforward network with appropriate structure, trained end-to-end with a direct regression objective.
Improved Techniques for Training GANs
TLDR
This work focuses on two applications of GANs: semi-supervised learning, and the generation of images that humans find visually realistic, and presents ImageNet samples with unprecedented resolution and shows that the methods enable the model to learn recognizable features of ImageNet classes.
Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval
TLDR
It is shown that scene graphs can be effectively created automatically from a natural language scene description and that using the output of the parsers is almost as effective as using human-constructed scene graphs.
Perceptual Losses for Real-Time Style Transfer and Super-Resolution
TLDR
This work considers image transformation problems, and proposes the use of perceptual loss functions for training feed-forward networks for image transformation tasks, and shows results on image style transfer, where aFeed-forward network is trained to solve the optimization problem proposed by Gatys et al. in real-time.
...
...