Scaling Laws for Generative Mixed-Modal Language Models
@article{Aghajanyan2023ScalingLF, title={Scaling Laws for Generative Mixed-Modal Language Models}, author={Armen Aghajanyan and L. Yu and Alexis Conneau and Wei-Ning Hsu and Karen Hambardzumyan and Susan Zhang and Stephen Roller and Naman Goyal and Omer Levy and Luke Zettlemoyer}, journal={ArXiv}, year={2023}, volume={abs/2301.03728} }
Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report…
Figures and Tables from this paper
One Citation
Exploring AI Ethics of ChatGPT: A Diagnostic Analysis
- Computer ScienceArXiv
- 2023
A qualitative research method on OpenAI’s ChatGPT is performed to better understand the practical features of ethical dangers in recent LLMs, and it is found that a significant number of ethical risks cannot be addressed by existing benchmarks, and hence illustrate them via additional case studies.
References
SHOWING 1-10 OF 41 REFERENCES
Reproducible scaling laws for contrastive language-image learning
- Computer ScienceArXiv
- 2022
It is found that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures and similar training recipes.
Scaling Laws for Neural Machine Translation
- Computer ScienceICLR
- 2022
A formula is proposed which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and it is shown that it gives accurate predictions under a variety of scaling approaches and languages.
Efficient Training of Language Models to Fill in the Middle
- Computer ScienceArXiv
- 2022
There is extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales.
Training Compute-Optimal Large Language Models
- Computer ScienceArXiv
- 2022
This paper trains a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4 × more more data, and reaches a state-of-the-art average accuracy on the MMLU benchmark.
Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
- Computer ScienceICML
- 2022
Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni- modal tasks.
BARTSmiles: Generative Masked Language Models for Molecular Representations
- Computer ScienceArXiv
- 2022
A robust self-supervised strategy tailored towards molecular representations for generative masked language models is discovered through a series of tailored, in-depth ablations, and it is quantitatively shown that when applied to the molecular domain, the BART objective learns representations that implicitly encode the authors' downstream tasks of interest.
Scaling Laws for Acoustic Models
- Computer ScienceInterspeech
- 2021
This paper demonstrates that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws, and finds that the scaling laws accurately match model performance over two orders of magnitude in both model size and training set size.
CM3: A Causal Masked Multimodal Model of the Internet
- Computer ScienceArXiv
- 2022
The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative modeling while also providing bidirectional context when generating the masked spans.
Language Models are Few-Shot Learners
- Computer ScienceNeurIPS
- 2020
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
This work introduces @MERLOT RESERVE, a model that represents videos jointly over time - through a new training objective that learns from audio, subtitles, and video frames, and obtains competitive results on four video tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark.