Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging

@article{Kaddour2022StopWM,
  title={Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging},
  author={Jean Kaddour},
  journal={ArXiv},
  year={2022},
  volume={abs/2209.14981}
}
  • Jean Kaddour
  • Published 29 September 2022
  • Computer Science
  • ArXiv
Training vision or language models on large datasets can take days, if not weeks. We show that averaging the weights of the k latest checkpoints, each collected at the end of an epoch, can speed up the training progression in terms of loss and accuracy by dozens of epochs, corresponding to time savings up to ~ 68 and ~ 30 GPU hours when training a ResNet50 on ImageNet and RoBERTa-Base model on WikiText-103, respectively. We also provide the code and model checkpoint trajectory to reproduce the… 

Figures from this paper

When Do Flat Minima Optimizers Work?

Comparing the loss surfaces of the models trained with each method and through broad benchmarking across computer vision, natural language processing, and graph representation learning tasks discover several surprising results, which it hopes will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.

Recycling diverse models for out-of-distribution generalization

Foundation models are redefining how AI systems are built. Practitioners now follow a standard procedure to build their machine learning solutions: from a pre-trained foundation model, they fine-tune

Weight Averaging: A Simple Yet Effective Method to Overcome Catastrophic Forgetting in Automatic Speech Recognition

A simple yet effective method to overcome catastrophic forgetting: weight averaging is proposed: simply taking the average of the previous and the adapted model and achieves high performance on both the old and new tasks.

References

SHOWING 1-10 OF 38 REFERENCES

Sharpness-Aware Minimization for Efficiently Improving Generalization

This work introduces a novel, effective procedure for simultaneously minimizing loss value and loss sharpness, Sharpness-Aware Minimization (SAM), which improves model generalization across a variety of benchmark datasets and models, yielding novel state-of-the-art performance for several.

Averaging Weights Leads to Wider Optima and Better Generalization

It is shown that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training, and Stochastic Weight Averaging (SWA) is extremely easy to implement, improves generalization, and has almost no computational overhead.

Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

A Fair Comparison of Two Popular Flat-Minima Optimizers: Stochastic Weight Averaging vs. Sharpness-Aware Minimization

A number of surprising results are discovered from a broad benchmarking across computer vision, natural language processing, and graph representation learning tasks that will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.

Causal Machine Learning: A Survey and Open Problems

This work categorizes work in CausalML into five groups according to the problems they address, and systematically compare the methods in each category and point out open problems.

Trainable Weight Averaging for Fast Convergence and Better Generalization

Trainable Weight Averaging (TWA) is proposed, essentially a novel training method in a reduced subspace spanned by historical solutions that largely reduces the estimation error from SWA, making it not only further improve the SWA solutions but also take full advantage of the solutions generated in the head of training where SWA fails.

Training Compute-Optimal Large Language Models

This paper trains a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4 × more more data, and reaches a state-of-the-art average accuracy on the MMLU benchmark.

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

The model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks.

Stochastic Weight Averaging Revisited

It is shown that PSWA outperforms its backbone SGD remarkably during the early stage of the SGD sampling process, and thus it is demonstrated that the hypothesis that there are global scale geometric structures in the DNN loss landscape that can be discovered by an SGD agent at theEarly stage of its working period can be exploited by the WA operation.