Reproducible scaling laws for contrastive language-image learning

  title={Reproducible scaling laws for contrastive language-image learning},
  author={Mehdi Cherti and Romain Beaumont and Ross Wightman and Mitchell Wortsman and Gabriel Ilharco and Cade Gordon and Christoph Schuhmann and Ludwig Schmidt and Jenia Jitsev},
Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set size, model size, and compute, which offers valuable guidance as large-scale experiments are becoming increasingly expensive. However, previous work on scaling laws has primarily used private data & models or focused on uni-modal language or vision learning. To address these limitations, we investigate scaling laws for… 

Scaling Laws for Generative Mixed-Modal Language Models

New mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them are reported, and the optimal synergy and competition due to data and model size is explicitly model as an additive term to previous uni-modAL scaling laws.



Deep Learning Scaling is Predictable, Empirically

A large scale empirical characterization of generalization error and model size growth as training sets grow is presented and it is shown that model size scales sublinearly with data size.

Scaling Up Vision-Language Pretraining for Image Captioning

LEMON O, a LargE-scale iMage captiONer, is presented, and the first empirical study on the scaling behavior of VLP for image captioning is provided, and it is shown LEMON can generate captions with long-tail vi-sual concepts when used in a zero-shot manner.

Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

The experiments show that the more diverse training distribution is the main cause for the robustness gains, with the other factors contributing little to no robustness.

A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visual

LAION-5B: An open large-scale dataset for training next generation image-text models

This work presents LAION-5B a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language, and shows successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discusses further experiments enabled with an openly available dataset of this scale.

Beyond neural scaling laws: beating power law scaling via data pruning

Overall, this work suggests that the discovery of good data-pruning metrics may provide a viable path forward to substantially improved neural scaling laws, thereby reducing the resource costs of modern deep learning.

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss, and it is shown that the scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.

PaLI: A Jointly-Scaled Multilingual Language-Image Model

PaLI achieves state-of-the-art in multiple vision and language tasks, while retaining a simple, modular, and scalable design.

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Evaluating the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models.

Language Models are Few-Shot Learners

GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.