Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

  title={Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer},
  author={Greg Yang and Edward J. Hu and Igor Babuschkin and Szymon Sidor and Xiaodong Liu and David Farhi and Nick Ryder and Jakub W. Pachocki and Weizhu Chen and Jianfeng Gao},
Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. We show that, in the recently discovered Maximal Update Parametrization ( µ P), many optimal HPs remain stable even as model size changes. This leads to a new HP tuning paradigm we call µ Transfer : parametrize the target model in µ P, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly… 
The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at Initialization
This work identifies the precise scaling of the activation function necessary to arrive at a non-trivial limit, and shows that the random covariance matrix is governed by a stochastic differential equation (SDE) that it is called the Neural Covariance SDE.
Model-Parallel Task Parallelism for Efficient Multi-Large-Model Deep Learning
H YDRA decouples scalability of model parameters from parallelism of execution, thus enabling DL users to train even a 6-billion parameter model on a single commodity GPU and fully exploits the higher speedup potential offered by task parallelism in a multi-GPU setup, yielding near-linear strong scaling and in turn, making rigorous model selection perhaps more practical for such models.
Dataset Distillation using Neural Feature Regression
The proposed algorithm is analogous to truncated backpropagation through time with a pool of models to alleviate various types of overfitting in dataset distillation and outperforms the previous methods on CIFAR100, Tiny ImageNet, and ImageNet-1K.
A Case of Exponential Convergence Rates for SVM
A simple mechanism to obtain fast convergence rates and its usage for SVM is presented and it is shown that SVM can exhibit exponential convergence rates even without assuming the hard Tsybakov margin condition.
Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks
Comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory are provided, showing that each of these approximations can break down in regimes where general self- Consistent solutions still provide an accurate description.
METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals
This work conducts a comprehensive empirical study, and proposes a recipe, namely “Model generated dEnoising TRaining Objective” (METRO), which incorporates some of the best modeling techniques developed recently to speed up, stabilize, and enhance pretrained language models without compromising model effectiveness.
How Do Graph Networks Generalize to Large and Diverse Molecular Systems?
The GemNet-OC model is proposed, which outperforms the previous state-of-the-art on OC20 by 16 %, while reducing training time by a factor of 10, and challenge the common belief that graph neural networks work equally well independent of dataset size and diversity.
Data-Centric Green AI: An Exploratory Empirical Study
Evidence is shown that, by exclusively conducting modifications on datasets, energy consumption can be drastically reduced, often at the cost of a negligible or even absent accuracy decline, which calls for a research agenda that focuses on data-centric techniques.
Training Compute-Optimal Large Language Models
This paper trains a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4 × more more data, and reaches a state-of-the-art average accuracy on the MMLU benchmark.
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model trained on the Pile, whose weights will be made freely and openly available to the public through a permissive license.


Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Feature Learning in Infinite-Width Neural Networks
It is shown that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT, and any such infinite- width limit can be computed using the Tensor Programs technique.
Language Models are Few-Shot Learners
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
PyTorch: An Imperative Style, High-Performance Deep Learning Library
This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.
An Empirical Model of Large-Batch Training
It is demonstrated that a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications, including a number of supervised learning datasets, reinforcement learning domains, and even generative model training (autoencoders on SVHN).
Wide residual networks, 2017
  • 2017
Pay Attention to MLPs
This work proposes a simple attention-free network architecture, gMLP, based solely on MLPs with gating, and shows that it can perform as well as Transformers in key language and vision applications and can scale as much as Transformers over increased data and compute.
Tensor Programs IIb: Architectural Universality of Neural Tangent Kernel Training Dynamics
This work shows the same neural networks in the so-called NTK parametrization during training follow a kernel gradient descent dynamics in function space, where the kernel is the infinite-width NTK.
ResMLP: Feedforward networks for image classification with data-efficient training
ResMLP is a simple residual network that alternates a linear layer in which image patches interact, independently and identically across channels, and a two-layer feed-forward network in which channels interact independently per patch.