Scaling Laws for Acoustic Models

@inproceedings{Droppo2021ScalingLF,
  title={Scaling Laws for Acoustic Models},
  author={Jasha Droppo and Oguz H. Elibol},
  booktitle={Interspeech},
  year={2021}
}
There is a recent trend in machine learning to increase model quality by growing models to sizes previously thought to be unreasonable. Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships, or scaling laws, that predict model quality from model size, training set size, and the available compute budget. These scaling laws allow one to choose nearly optimal hyper-parameters given constraints on available training… 

Figures and Tables from this paper

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

TLDR
This work finds that scaling laws emerge at finetuning time in some NLP tasks, and that they can also be exploited for debugging convergence when training large models.

Scaling ASR Improves Zero and Few Shot Learning

TLDR
By training 1-10B parameter universal English ASR models, this work pushes the limits of speech recognition performance across many domains and proposes data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets.

Scaling Laws and Interpretability of Learning from Repeated Data

TLDR
It is shown that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible mechanism for the shift from generalization to memorization.

Predictability and Surprise in Large Generative Models

TLDR
This paper highlights a counterintuitive property of large-scale generative models, which have a paradoxical combination of predictable loss on a broad training distribution, and unpredictable specific capabilities, inputs, and outputs, and analyzed how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment.

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

TLDR
An anomaly detection task for aberrant policies is proposed and several baseline detectors are offered for phase transitions: capability thresholds at which the agent’s behavior qualitatively shifts, leading to a sharp decrease in the true reward.

References

SHOWING 1-10 OF 13 REFERENCES

Scaling Laws for Autoregressive Generative Modeling

TLDR
The case that scaling laws have important implications for neural network performance, including on downstream tasks is strengthened, as empirical scaling laws for the cross-entropy loss are identified.

Scaling Laws for Neural Language Models

TLDR
Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

Generative Pre-Training for Speech with Autoregressive Predictive Coding

  • Yu-An ChungJames R. Glass
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
This paper proposes to use autoregressive predictive coding (APC), a recently proposed self-supervised objective, as a generative pre-training approach for learning meaningful, non-specific, and transferable speech representations.

DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization

TLDR
This work proposes DeCoAR 2.0, a Deep Contextualized Acoustic Representation with vector quantization, which uses Transformers in encoding module instead of LSTMs and proposes an objective that combines the reconstructive loss withvector quantization diversity loss to train speech representations.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

TLDR
This work simplifies the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs, and advances the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieves a 4x speedup over the T5-XXL model.

Representation Learning with Contrastive Predictive Coding

TLDR
This work proposes a universal unsupervised learning approach to extract useful representations from high-dimensional data, which it calls Contrastive Predictive Coding, and demonstrates that the approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being

Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition

TLDR
This work first exploits a large amount of unlabeled audio data via representation learning, where it reconstructs a temporal slice of filterbank features from past and future context frames to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Self-Training and Pre-Training are Complementary for Speech Recognition

TLDR
P pseudo-labeling and pre-training with wav2vec 2.0 are complementary in a variety of labeled data setups to improve speech recognition systems using unlabeled data.