Reproducible scaling laws for contrastive language-image learning

  title={Reproducible scaling laws for contrastive language-image learning},
  author={Mehdi Cherti and Romain Beaumont and Ross Wightman and Mitchell Wortsman and Gabriel Ilharco and Cade Gordon and Christoph Schuhmann and Ludwig Schmidt and Jenia Jitsev},
Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set size, model size, and compute, which offers valuable guidance as large-scale experiments are becoming increasingly expensive. However, previous work on scaling laws has primarily used private data & models or focused on uni-modal language or vision learning. To address these limitations, we investigate scaling laws for… 

Evaluating Self-Supervised Learning via Risk Decomposition

An SSL risk decomposition is proposed, which generalizes the classical supervised approximation-estimation decomposition by considering errors arising from the representation learning step and gives valuable insights for designing and using SSL models.

A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models

Using the proposed scoring method to create a weighted average prompt ensemble, the method outperforms equal average ensemble, as well as hand-crafted prompts, on ImageNet, 4 of its variants, and 11 fine-grained classification benchmarks, all while being fully automatic, optimization-free, and not requiring access to labeled validation data.

Does CLIP Know My Face?

This work introduces a novel method to assess privacy for multi-modal models, specifically vision-language models like CLIP, and suggests that IDIAs can be used to prove the unauthorized use of data for training and to enforce privacy laws.

Exploring the Representation Manifolds of Stable Diffusion Through the Lens of Intrinsic Dimension

This work finds that choice of prompt has a substantial impact on the intrinsic dimension of representations at both layers of the model which it is explored, but that the nature of this impact depends on the layer being considered.

Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery

This work describes an approach to robustly optimize hard text prompts through efficient gradient-based optimization and shows that hard prompts can be automatically discovered that are effective in tuning LMs for classification.

Contrastive Language-Image Pretrained (CLIP) Models are Powerful Out-of-Distribution Detectors

This study examines several setups, based on the availability of labels or image captions and using different combinations of in- and out-distributions, and finds that contrastive language-image pretrained models achieve state-of-the-art unsupervised out- of-distribution performance using nearest neighbors feature similarity as the OOD detection score.

Scaling Laws for Generative Mixed-Modal Language Models

New mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them are reported, and the optimal synergy and competition due to data and model size is explicitly model as an additive term to previous uni-modAL scaling laws.

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

  • Morris AlperMichael FimanHadar Averbuch-Elor
  • Computer Science
  • 2023
It is concluded that exposure to images during pretraining affords inherent visual reasoning knowledge that is reflected in language-only tasks that require implicit visual reasoning, providing principled guidelines for the choice of text encoders used in such contexts.

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

  • Seokju ChoHeeseong Shin Seung Wook Kim
  • Computer Science
  • 2023
This work proposes an alternative approach to optimize the image-text similarity map, i.e. the cost map, using a novel cost aggregation-based method, and proposes a framework, namely CAT-Seg, which achieves state-of-the-art performance across all benchmarks.

EVA-02: A Visual Representation for Neon Genesis

  • Yuxin FangQuan SunXinggang WangTiejun HuangXinlong WangYue Cao
  • Computer Science
  • 2023
EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling, is launched, demonstrating superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets.



Scaling Up Vision-Language Pretraining for Image Captioning

LEMON O, a LargE-scale iMage captiONer, is presented, and the first empirical study on the scaling behavior of VLP for image captioning is provided, and it is shown LEMON can generate captions with long-tail vi-sual concepts when used in a zero-shot manner.

Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

The experiments show that the more diverse training distribution is the main cause for the robustness gains, with the other factors contributing little to no robustness.

Scaling Vision Transformers

A ViT model with two billion parameters is successfully trained, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy and performs well for few-shot transfer.

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss, and it is shown that the scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.

PaLI: A Jointly-Scaled Multilingual Language-Image Model

PaLI achieves state-of-the-art in multiple vision and language tasks, while retaining a simple, modular, and scalable design.

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Evaluating the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models.

Language Models are Few-Shot Learners

GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

Big Transfer (BiT): General Visual Representation Learning

By combining a few carefully selected components, and transferring using a simple heuristic, Big Transfer achieves strong performance on over 20 datasets and performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples.

Flamingo: a Visual Language Model for Few-Shot Learning

It is demonstrated that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples.

Scaling Laws for Neural Language Models

Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.