Large-scale Text-to-Image Generation Models for Visual Artists' Creative Works

  title={Large-scale Text-to-Image Generation Models for Visual Artists' Creative Works},
  author={Hyung-Kwon Ko and Gwanmo Park and Hyeon Jeon and Jaemin Jo and Juho Kim and Jinwook Seo},
Large-scale Text-to-image Generation Models (LTGMs) (e.g., DALL-E), self-supervised deep learning models trained on a huge dataset, have demonstrated the capacity for generating high-quality open-domain images from multi-modal input. Although they can even produce anthropomorphized versions of objects and animals, com-bine irrelevant concepts in reasonable ways, and give variation to any user-provided images, we witnessed such rapid technological advancement left many visual artists disoriented… 

Figures and Tables from this paper



LAION-5B: An open large-scale dataset for training next generation image-text models

This work presents LAION-5B a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language, and shows successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discusses further experiments enabled with an openly available dataset of this scale.

Design Guidelines for Prompt Engineering Text-to-Image Generative Models

A study exploring what prompt keywords and model hyperparameters can help produce coherent outputs from text-to-image generative models, structured to include subject and style keywords and investigates success and failure modes of these prompts.

Initial Images: Using Image Prompts to Improve Subject Representation in Multimodal AI Generated Art

Advances in text-to-image generative models have made it easier for people to create art by just prompting models with text. However, creating through text leaves users with limited control over the

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

This work uses only 3 - 5 images of a user-provided concept to represent it through new “words” in the embedding space of a frozen text-to-image model, which can be composed into natural language sentences, guiding personalized creation in an intuitive way.

Opal: Multimodal Image Generation for News Illustration

How structured exploration can help users better understand the capabilities of human AI co-creative systems is discussed, and Opal, a system that produces text-to-image generations for news illustration, is addressed.

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

The Pathways Autoregressive Text-to-Image (Parti) model is presented, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge and explores and highlights limitations of the models.

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

A novel text-to-image method that addresses gaps by enabling a simple control mechanism complementary to text in the form of a scene, and introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects).

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers

It is shown that recent text-to-image generative transformer models perform better in recognizing and counting objects than recognizing colors and understanding spatial relations, while there exists a large gap between the model performances and upper bound accuracy on all skills.

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Compared to several strong baselines, N¨UWA achieves state-of-the-art results on text-to-image generation, text- to-video generation, video prediction, etc, and shows surprisingly good zero-shot capabilities on text and video manipulation tasks.

TaleBrush: Sketching Stories with Generative Pretrained Language Models

TaleBrush is introduced, a generative story ideation tool that uses line sketching interactions with a GPT-based language model for control and sensemaking of a protagonist’s fortune in co-created stories and a reflection on how Sketching interactions can facilitate the iterative human-AI co-creation process.