Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

  title={Scaling Autoregressive Models for Content-Rich Text-to-Image Generation},
  author={Jiahui Yu and Yuanzhong Xu and Jing Yu Koh and Thang Luong and Gunjan Baid and Zirui Wang and Vijay Vasudevan and Alexander Ku and Yinfei Yang and Burcu Karagol Ayan and Benton C. Hutchinson and Wei Han and Zarana Parekh and Xin Li and Han Zhang and Jason Baldridge and Yonghui Wu},
We present the Pathways [1] Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large… 

Multimodal Image Synthesis and Editing: A Survey

This survey comprehensively contextualize the advance of the recent multimodal image synthesis and editing and formulate taxonomies according to data modality and model architectures.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

This work uses only 3 - 5 images of a user-provided concept to represent it through new “words” in the embedding space of a frozen text-to-image model, which can be composed into natural language sentences, guiding personalized creation in an intuitive way.

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

NUWA-Infinity, a generative model for in-time visual synthesis, which is presented as the task of generating arbitrarily-sized high-resolution images or long-duration videos, has superior visual synthesis capabilities in terms of resolution and variable-size generation.

Extremely Simple Activation Shaping for Out-of-Distribution Detection

The separation between training and deployment of machine learning models implies that not all scenarios encountered in deployment can be anticipated during training, and therefore relying solely on

The Biased Artist: Exploiting Cultural Biases via Homoglyphs in Text-Guided Image Generation Models

Text-guided image generation models, such as DALL-E 2 and Stable Diffusion, have recently received much attention from academia and the general public. Provided with textual descriptions, these

CLIPping Privacy: Identity Inference Attacks on Multi-Modal Machine Learning Models

A new type of privacy attack, named identity inference attack (IDIA), designed for multi-modal image-text models like CLIP, is introduced, which shows that the attacker can identify individuals used for training with very high accuracy and that the model learns to connect the names with the depicted people.

AudioLM: a Language Modeling Approach to Audio Generation

The proposed hybrid tokenization scheme leverages the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis.

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

A new approach for “personalization” of text-to-image diffusion models (specializing them to users’ needs) and leveraging the semantic prior embedded in the model with a new autogenous class-specific prior preservation loss enables synthesizing the subject in diverse scenes, poses, views, and lighting conditions that do not appear in the reference images.

Text to Image Generation: Leaving no Language Behind

An initial exploration of how the performance of three popular text-to-image generators depends on the language shows that there is a performance degradation when using languages other than English, especially for languages that are not widely used.

Finding Reusable Machine Learning Components to Build Programming Language Processing Pipelines

To improve theability, accessibility, interoperability and reusability (FAIRness) of machine learning components, a set of representative papers in the domain of machineLearning-based PLP are collected and analyzed.



Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

A novel text-to-image method that addresses gaps by enabling a simple control mechanism complementary to text in the form of a scene, and introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects).

Taming Transformers for High-Resolution Image Synthesis

It is demonstrated how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images.

Connecting Vision and Language with Localized Narratives

An extensive analysis of Localized Narratives is provided showing they are diverse, accurate, and efficient to produce and their utility on the application of controlled image captioning is demonstrated.

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

This work proposes a two time-scale update rule (TTUR) for training GANs with stochastic gradient descent on arbitrary GAN loss functions and introduces the "Frechet Inception Distance" (FID) which captures the similarity of generated images to real ones better than the Inception Score.

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

This work presents Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding, and finds that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment.

CoCa: Contrastive Captioners are Image-Text Foundation Models

A minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM.

Cross-Modal Contrastive Learning for Text-to-Image Generation

The Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses the challenge of text-to-image synthesis systems by maximizing the mutual information between image and text via multiple contrastive losses which capture inter- modality and intra-modality correspondences.

Zero-Shot Text-to-Image Generation

This work describes a simple approach based on a transformer that autoregressively models the text and image tokens as a single stream of data that is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

This work demonstrates empirically that adaptive methods can produce larger-than-desired updates when the decay rate of the second moment accumulator is too slow, and proposes update clipping and a gradually increasing decay rate scheme as remedies.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.