• Corpus ID: 235742820

Improving Text-to-Image Synthesis Using Contrastive Learning

@inproceedings{Ye2021ImprovingTS,
  title={Improving Text-to-Image Synthesis Using Contrastive Learning},
  author={Hui Ye and Xiulong Yang and Martin Tak{\'a}c and Rajshekhar Sunderraman and Shihao Ji},
  booktitle={BMVC},
  year={2021}
}
The goal of text-to-image synthesis is to generate a visually realistic image that matches a given text description. In practice, the captions annotated by humans for the same image have large variance in terms of contents and the choice of words. The linguistic discrepancy between the captions of the identical image leads to the synthetic images deviating from the ground truth. To address this issue, we propose a contrastive learning approach to improve the quality and enhance the semantic… 

Figures and Tables from this paper

Towards Language-Free Training for Text-to-Image Generation

TLDR
This paper proposes the first work to train text-to-image generation models without any text data, and leverages the well-aligned multimodal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features.

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

TLDR
A novel text-to-image method that addresses gaps by enabling a simple control mechanism complementary to text in the form of a scene, and introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects).

C: Contrastive Learning for Cross-domain Correspondence in Few-shot Image Generation

TLDR
This paper proposes a simple yet effective method C, Contrastive Learning for Cross-domain Correspondence, which constitutes the positive and negative pairs of images from two different domains and makes the generative model learn the cross-domain correspondence explicitly via contrastive learning.

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

TLDR
This work explores diffusion models for the problem of text-conditional image synthesis and compares two different guidance strategies: CLIP guidance and classifier-free guidance, finding that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples.

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

TLDR
This work presents Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding, and finds that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment.

Discriminator Modification in GAN for Text-to-Image Generation

TLDR
This paper proposes a diversity-sensitive conditional discriminator (D-SCD), a contrastive searching gradient penalty (CSGP) strategy to measure the realism of the generated images and to penalize the gradients for stabilizing the training process, and introduces a multi-level images similarity (MLIS) loss for the discriminator feature extractor to further promote the high-level feature similarity between the real and generated image and objects.

A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint

TLDR
This study successively follows previous surveys on T2I, adding value by analogously evaluating the diverse range of existing methods, including different generative models, several types of visual output, critical examination of various approaches, and highlighting the shortcomings, suggesting the future direction of research.

Multimodal Conditional Image Synthesis with Product-of-Experts GANs

TLDR
The Product-of-Experts Generative Adversarial Networks (PoE-GAN) framework is proposed, which can synthesize images conditioned on multiple input modalities or any subset of them, even the empty set, to advance the state of the art in multimodal conditional image synthesis.

TISE: A Toolbox for Text-to-Image Synthesis Evaluation

TLDR
A combined set of existing and new metrics to systematically evaluate state-of-the-art methods for singleand multi-object text-to-image synthesis are proposed and a strong baseline model is created for the benchmark, which results in a highly consistent ranking among existing methods, being well-aligned to human evaluation.

CSCI 1430 Final Project Report: Text-to-Image Synthesis By Separating “Verbal” From “Nonverbal” Information Using Residual Auto-Encoder

TLDR
This paper illustrates an approach to text-to-image synthesis that uses significantly less resource and is able to generate decent images from the authors' custom Shapes data.

References

SHOWING 1-10 OF 48 REFERENCES

RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge

TLDR
A novel rich feature generating text-to-image synthesis, called RiFeGAN, to enrich the given description and exploits multi-captions attentional generative adversarial networks to synthesize images from those features.

Semantics Disentangling for Text-To-Image Generation

TLDR
A novel photo-realistic text-to-image generation model that implicitly disentangles semantics to both fulfill the high- level semantic consistency and low-level semantic diversity and a visual-semantic embedding strategy by semantic-conditioned batch normalization to find diverse low- level semantics.

Cross-Modal Contrastive Learning for Text-to-Image Generation

TLDR
The Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses the challenge of text-to-image synthesis systems by maximizing the mutual information between image and text via multiple contrastive losses which capture inter- modality and intra-modality correspondences.

CPGAN: Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis

TLDR
This paper designs a memory structure to parse the textual content by exploring semantic correspondence between each word in the vocabulary to its various visual contexts across relevant images during text encoding to model the text-to-image consistency in the semantic level.

MirrorGAN: Learning Text-To-Image Generation by Redescription

TLDR
Thorough experiments on two public benchmark datasets demonstrate the superiority of MirrorGAN over other representative state-of-the-art methods.

Adversarial Learning of Semantic Relevance in Text to Image Synthesis

TLDR
A new approach that improves the training of generative adversarial nets (GANs) for synthesizing diverse images from a text input is described, based on the conditional version of GANs and expands on previous work leveraging an auxiliary task in the discriminator.

Controllable Text-to-Image Generation

TLDR
A novel controllable text-to-image generative adversarial network (ControlGAN) is proposed, which can effectively synthesise high-quality images and also control parts of the image generation according to natural language descriptions.

StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks

TLDR
This paper proposes Stacked Generative Adversarial Networks (StackGAN) to generate 256 photo-realistic images conditioned on text descriptions and introduces a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold.

DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis

  • Minfeng ZhuP. PanWei ChenYi Yang
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
TLDR
The proposed DM-GAN model introduces a dynamic memory module to refine fuzzy image contents, when the initial images are not well generated, and performs favorably against the state-of-the-art approaches.

DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis

TLDR
A novel simplified text-to-image backbone which is able to synthesize high-quality images directly by one pair of generator and discriminator, a novel regularization method called Matching-Aware zero-centered Gradient Penalty and a novel fusion module which can exploit the semantics of text descriptions effectively and fuse text and image features deeply during the generation process.