CoCa: Contrastive Captioners are Image-Text Foundation Models

@article{Yu2022CoCaCC,
  title={CoCa: Contrastive Captioners are Image-Text Foundation Models},
  author={Jiahui Yu and Zirui Wang and Vijay Vasudevan and Legg Yeung and Mojtaba Seyedhosseini and Yonghui Wu},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.01917}
}
Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Co ntrastive Ca ptioner ( CoCa ), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In con-trast to standard encoder… 
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
TLDR
The Pathways Autoregressive Text-to-Image (Parti) model is presented, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge and explores and highlights limitations of the models.
GIT: A Generative Image-to-text Transformer for Vision and Language
TLDR
This paper designs and train a GIT to unify vision-language tasks such as image/video captioning and question answering and presents a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.
Prefix Conditioning Unifies Language and Label Supervision
TLDR
This work proposes a simple yet effective approach to unify these two types of supervision using prefix tokens that inform a language encoder of the type of the input sentence (e.g., caption or prompt) at training time.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
TLDR
This work presents Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding, and finds that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment.
REVECA - Rich Encoder-decoder framework for Video Event CAptioner
TLDR
A Rich Encoder–decoder framework for Video Event CAptioner (REVECA) that utilizes spatial and temporal information from the video to generate a caption for the corresponding the event boundary.
VL-BEiT: Generative Vision-Language Pretraining
TLDR
A vision-language foundation model called VL-BE I T, which is a bidirectional multimodal Transformer learned by generative pretraining, is introduced, which effectively leverages monomodal data like images and texts as well as multimodals data like image-text pairs.
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
TLDR
This work presents the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning, and proposes an entropy-based regularization scheme for which it is demonstrated remarkable performance improvement over dense models of equivalent computational cost.
Multimodal Masked Autoencoders Learn Transferable Representations
TLDR
This paper proposes a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and language data via masked token prediction, and demonstrates the scalability of M3AE with larger model size and training time, and its ability to learn generalizable representations that transfer well to downstream tasks.
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
TLDR
This work presents FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks and provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data.
Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning
TLDR
This work argues that by using visual clues to bridge large pretrained vision foundation models and language models, they can do so without any extra cross-modal training.