CoCa: Contrastive Captioners are Image-Text Foundation Models

  title={CoCa: Contrastive Captioners are Image-Text Foundation Models},
  author={Jiahui Yu and Zirui Wang and Vijay Vasudevan and Legg Yeung and Mojtaba Seyedhosseini and Yonghui Wu},
Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Co ntrastive Ca ptioner ( CoCa ), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In con-trast to standard encoder… 
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
The Pathways Autoregressive Text-to-Image (Parti) model is presented, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge and explores and highlights limitations of the models.
GIT: A Generative Image-to-text Transformer for Vision and Language
This paper designs and train a GIT to unify vision-language tasks such as image/video captioning and question answering and presents a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.
Prefix Conditioning Unifies Language and Label Supervision
This work proposes a simple yet effective approach to unify these two types of supervision using prefix tokens that inform a language encoder of the type of the input sentence (e.g., caption or prompt) at training time.
Transferring Textual Knowledge for Visual Recognition
The role of the linear classifier is revised and replaced with the embedded language representations of the object categories and the paradigm achieves the state-of-the-art accuracy of 87.3% on Kinetics-400.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
This work presents Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding, and finds that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment.
REVECA - Rich Encoder-decoder framework for Video Event CAptioner
A Rich Encoder–decoder framework for Video Event CAptioner (REVECA) that utilizes spatial and temporal information from the video to generate a caption for the corresponding the event boundary.
VL-BEiT: Generative Vision-Language Pretraining
A vision-language foundation model called VL-BE I T, which is a bidirectional multimodal Transformer learned by generative pretraining, is introduced, which effectively leverages monomodal data like images and texts as well as multimodals data like image-text pairs.
Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models
In MOV, the vision encoder from pre-trained VLMs with minimal modifications is directly used to encode video, optical flow and audio spectrogram, and a cross-modal fusion mechanism to aggregate complimentary multimodal information is designed.
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
This work presents the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning, and proposes an entropy-based regularization scheme for which it is demonstrated remarkable performance improvement over dense models of equivalent computational cost.
Multimodal Masked Autoencoders Learn Transferable Representations
This paper proposes a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE), which learns a unified encoder for both vision and language data via masked token prediction, and demonstrates the scalability of M3AE with larger model size and training time, and its ability to learn generalizable representations that transfer well to downstream tasks.