Scaling Laws for Generative Mixed-Modal Language Models

  title={Scaling Laws for Generative Mixed-Modal Language Models},
  author={Armen Aghajanyan and L. Yu and Alexis Conneau and Wei-Ning Hsu and Karen Hambardzumyan and Susan Zhang and Stephen Roller and Naman Goyal and Omer Levy and Luke Zettlemoyer},
Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report… 

Exploring AI Ethics of ChatGPT: A Diagnostic Analysis

A qualitative research method on OpenAI’s ChatGPT is performed to better understand the practical features of ethical dangers in recent LLMs, and it is found that a significant number of ethical risks cannot be addressed by existing benchmarks, and hence illustrate them via additional case studies.



Reproducible scaling laws for contrastive language-image learning

It is found that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures and similar training recipes.

Scaling Laws for Neural Machine Translation

A formula is proposed which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and it is shown that it gives accurate predictions under a variety of scaling approaches and languages.

Efficient Training of Language Models to Fill in the Middle

There is extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales.

Training Compute-Optimal Large Language Models

This paper trains a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4 × more more data, and reaches a state-of-the-art average accuracy on the MMLU benchmark.

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni- modal tasks.

BARTSmiles: Generative Masked Language Models for Molecular Representations

A robust self-supervised strategy tailored towards molecular representations for generative masked language models is discovered through a series of tailored, in-depth ablations, and it is quantitatively shown that when applied to the molecular domain, the BART objective learns representations that implicitly encode the authors' downstream tasks of interest.

Scaling Laws for Acoustic Models

This paper demonstrates that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws, and finds that the scaling laws accurately match model performance over two orders of magnitude in both model size and training set size.

CM3: A Causal Masked Multimodal Model of the Internet

The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative modeling while also providing bidirectional context when generating the masked spans.

Language Models are Few-Shot Learners

GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound

This work introduces @MERLOT RESERVE, a model that represents videos jointly over time - through a new training objective that learns from audio, subtitles, and video frames, and obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark.