DT2I: Dense Text-to-Image Generation from Region Descriptions
@article{Frolov2022DT2IDT, title={DT2I: Dense Text-to-Image Generation from Region Descriptions}, author={Stanislav Frolov and Prateek Bansal and J{\"o}rn Hees and Andreas R. Dengel}, journal={ArXiv}, year={2022}, volume={abs/2204.02035} }
. Despite astonishing progress, generating realistic images of complex scenes remains a challenging problem. Recently, layout-to-image synthesis approaches have attracted much interest by conditioning the generator on a list of bounding boxes and corresponding class labels. However, previous approaches are very restrictive because the set of labels is fixed a priori. Meanwhile, text-to-image synthesis methods have substantially improved and provide a flexible way for conditional image generation…
References
SHOWING 1-10 OF 46 REFERENCES
Learning Layout and Style Reconfigurable GANs for Controllable Image Synthesis
- Computer ScienceIEEE transactions on pattern analysis and machine intelligence
- 2021
An intuitive paradigm for the task, layout-to-mask- to-image, which learns to unfold object masks in a weakly-supervised way based on an input layout and object style codes is proposed and a method built on Generative Adversarial Networks (GANs) is presented.
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
An Attentional Generative Adversarial Network that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation and for the first time shows that the layered attentional GAN is able to automatically select the condition at the word level for generating different parts of the image.
Image Synthesis From Reconfigurable Layout and Style
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
This paper presents a layout- and style-based architecture for generative adversarial networks (termed LostGANs) that can be trained end-to-end to generate images from reconfigurable layout and style.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Computer ScienceNAACL
- 2019
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Generative Adversarial Text to Image Synthesis
- Computer ScienceICML
- 2016
A novel deep architecture and GAN formulation is developed to effectively bridge advances in text and image modeling, translating visual concepts from characters to pixels.
Adam: A Method for Stochastic Optimization
- Computer ScienceICLR
- 2015
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
AttrLostGAN: Attribute Controlled Image Synthesis from Reconfigurable Layout and Style
- Computer ScienceLecture Notes in Computer Science
- 2021
This paper extends a state-of-the-art approach for layout-to-image generation to additionally condition individual objects on attributes and shows that the method can successfully control the fine-grained details of individual objects when modelling complex scenes with multiple objects.
Image Synthesis from Layout with Locality-Aware Mask Adaption
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
Experimental results show the proposed model with LAMA outperforms existing approaches regarding visual fidelity and alignment with input layouts and improves the state-of-the-art FID score from 41.65 to 31.12 and the SceneFID from 22.00 to 18.64.
Learning to Compose Visual Relations
- Computer ScienceNeurIPS
- 2021
This work proposes to represent each relation as an unnormalized density (an energy-based model), enabling us to compose separate relations in a factorized manner, and shows that decomposition enables the model to effectively understand the underlying relational scene structure.