Corpus ID: 237532682

Scaling Laws for Neural Machine Translation

@article{Ghorbani2021ScalingLF,
  title={Scaling Laws for Neural Machine Translation},
  author={B. Ghorbani and Orhan Firat and Markus Freitag and Ankur Bapna and Maxim Krikun and Xavier Garc{\'i}a and Ciprian Chelba and Colin Cherry},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.07740}
}
We present an empirical study of scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT). We show that cross-entropy loss as a function of model size follows a certain scaling law. Specifically (i) We propose a formula which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and show that it gives accurate predictions under a variety of scaling approaches and languages; we show that the total number… Expand
Unsupervised Neural Machine Translation with Generative Language Models Only
TLDR
By using GPT-3’s zero-shot translation capability, this method achieves a new state-of-the-art in unsupervised translation on the WMT14 English-French benchmark, attaining a BLEU score of 42.1. Expand

References

SHOWING 1-10 OF 43 REFERENCES
Scaling Laws for Transfer
TLDR
This work finds that pre-training effectively multiplies the fine-tuning dataset size, and believes the exponents in these power-laws correspond to measures of the generality of a model and proximity of distributions (in a directed rather than symmetric sense). Expand
Deep Learning Scaling is Predictable, Empirically
TLDR
A large scale empirical characterization of generalization error and model size growth as training sets grow is presented and it is shown that model size scales sublinearly with data size. Expand
Scaling Laws for Neural Language Models
TLDR
Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence. Expand
Improving Neural Machine Translation Models with Monolingual Data
TLDR
This work pairs monolingual training data with an automatic back-translation, and can treat it as additional parallel training data, and obtains substantial improvements on the WMT 15 task English German, and for the low-resourced IWSLT 14 task Turkish->English. Expand
Explaining Neural Scaling Laws
TLDR
This work identifies variance-limited and resolution-limited scaling behavior for both dataset and model size, and identifies four related scaling regimes with respect to the number of model parameters P and the dataset size D. Expand
Data and Parameter Scaling Laws for Neural Machine Translation
We observe that the development cross001 entropy loss of supervised neural machine 002 translation models scales like a power law with 003 the amount of training data and the number of 004Expand
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
TLDR
GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models. Expand
Sequence-Level Knowledge Distillation
TLDR
It is demonstrated that standard knowledge distillation applied to word-level prediction can be effective for NMT, and two novel sequence-level versions of knowledge distilling are introduced that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search. Expand
Statistical Power and Translationese in Machine Translation Evaluation
TLDR
Detailed analysis of potential adverse effects of translationese on machine translation evaluation shows differences in conclusions drawn from evaluations that include translationese in test data compared to experiments that tested only with text originally composed in that language. Expand
Scaling Laws for Autoregressive Generative Modeling
TLDR
The case that scaling laws have important implications for neural network performance, including on downstream tasks is strengthened, as empirical scaling laws for the cross-entropy loss are identified. Expand
...
1
2
3
4
5
...