DT2I: Dense Text-to-Image Generation from Region Descriptions

@article{Frolov2022DT2IDT,
  title={DT2I: Dense Text-to-Image Generation from Region Descriptions},
  author={Stanislav Frolov and Prateek Bansal and J{\"o}rn Hees and Andreas R. Dengel},
  journal={ArXiv},
  year={2022},
  volume={abs/2204.02035}
}
. Despite astonishing progress, generating realistic images of complex scenes remains a challenging problem. Recently, layout-to-image synthesis approaches have attracted much interest by conditioning the generator on a list of bounding boxes and corresponding class labels. However, previous approaches are very restrictive because the set of labels is fixed a priori. Meanwhile, text-to-image synthesis methods have substantially improved and provide a flexible way for conditional image generation… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 46 REFERENCES
Learning Layout and Style Reconfigurable GANs for Controllable Image Synthesis
  • Wei Sun, Tianfu Wu
  • Computer Science
    IEEE transactions on pattern analysis and machine intelligence
  • 2021
TLDR
An intuitive paradigm for the task, layout-to-mask- to-image, which learns to unfold object masks in a weakly-supervised way based on an input layout and object style codes is proposed and a method built on Generative Adversarial Networks (GANs) is presented.
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks
TLDR
An Attentional Generative Adversarial Network that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation and for the first time shows that the layered attentional GAN is able to automatically select the condition at the word level for generating different parts of the image.
Image Synthesis From Reconfigurable Layout and Style
  • Wei Sun, Tianfu Wu
  • Computer Science
    2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
TLDR
This paper presents a layout- and style-based architecture for generative adversarial networks (termed LostGANs) that can be trained end-to-end to generate images from reconfigurable layout and style.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Generative Adversarial Text to Image Synthesis
TLDR
A novel deep architecture and GAN formulation is developed to effectively bridge advances in text and image modeling, translating visual concepts from characters to pixels.
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
AttrLostGAN: Attribute Controlled Image Synthesis from Reconfigurable Layout and Style
TLDR
This paper extends a state-of-the-art approach for layout-to-image generation to additionally condition individual objects on attributes and shows that the method can successfully control the fine-grained details of individual objects when modelling complex scenes with multiple objects.
Adversarial Text-to-Image Synthesis: A Review
Image Synthesis from Layout with Locality-Aware Mask Adaption
TLDR
Experimental results show the proposed model with LAMA outperforms existing approaches regarding visual fidelity and alignment with input layouts and improves the state-of-the-art FID score from 41.65 to 31.12 and the SceneFID from 22.00 to 18.64.
Learning to Compose Visual Relations
TLDR
This work proposes to represent each relation as an unnormalized density (an energy-based model), enabling us to compose separate relations in a factorized manner, and shows that decomposition enables the model to effectively understand the underlying relational scene structure.
...
...