• Corpus ID: 227240564

Neural networks with late-phase weights

@article{Oswald2021NeuralNW,
  title={Neural networks with late-phase weights},
  author={Johannes von Oswald and Seijin Kobayashi and Jo{\~a}o Sacramento and Alexander Meulemans and Christian Andreas Henning and Benjamin F. Grewe},
  journal={arXiv: Learning},
  year={2021}
}
The largely successful method of training neural networks is to learn their weights using some variant of stochastic gradient descent (SGD). Here, we show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning. At the end of learning, we obtain back a single model by taking a spatial average in weight space. To avoid incurring increased computational costs, we investigate a family of low-dimensional late-phase weight models which… 
Posterior Meta-Replay for Continual Learning
TLDR
This work study principled ways to tackle the CL problem by adopting a Bayesian perspective and focus on continually learning a task-specific posterior distribution via a shared meta-model, atask-conditioned hypernetwork, in sharp contrast to most Bayesian CL approaches that focus on the recursive update of a single posterior distribution.
Low-Loss Subspace Compression for Clean Gains against Multi-Agent Backdoor Attacks
TLDR
This paper contributes three defenses that yield improved multi-agent backdoor robustness that maximize accuracy w.r.t. clean labels and minimize that of poison labels.
Prune and Tune Ensembles: Low-Cost Ensemble Learning With Sparse Independent Subnetworks
TLDR
This work introduces a fast, low-cost method for creating ensembles of neural networks without needing to train multiple models from scratch, by first training a single parent network and dramatically pruning the parameters of each child to create an ensemble of members with unique and diverse topologies.
Efficient Self-Ensemble for Semantic Segmentation
TLDR
The self-ensemble approach takes advantage of the multi-scale features set produced by feature pyramid network methods to feed independent decoders, thus creating an ensemble within a single model and can be trained end-to-end, alleviating the traditional cumbersome multi-stage training of ensembles.
On the reversed bias-variance tradeoff in deep ensembles
TLDR
It is shown that under practical assumptions in the overparametrized regime far into the double descent curve, not only the ensemble test loss degrades, but common out-of-distribution detection and calibration metrics suffer as well, suggesting deep ensembles can benefit from early stopping.
Deep Ensembling with No Overhead for either Training or Testing: The All-Round Blessings of Dynamic Sparsity
TLDR
This work draws a unique connection between sparse neural network training and deep ensembles, yielding a novel efficient ensemble learning framework called FreeT ickets, which surpasses the dense baseline in all the following criteria: prediction accuracy, uncertainty estimation, out-of-distribution (OoD) robustness, as well as efficiency for both training and inference.
Reinforcement Learning, Bit by Bit
TLDR
To illustrate concepts, simple agents are designed that build on them and present computational results that highlight data efficiency.
Learning Neural Network Subspaces
TLDR
This work uses the subspace midpoint to boost accuracy, calibration, and robustness to label noise, outperforming Stochastic Weight Averaging and approaching the ensemble performance of independently trained networks without the training cost.

References

SHOWING 1-10 OF 99 REFERENCES
BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning
TLDR
BatchEnsemble is proposed, an ensemble method whose computational and memory costs are significantly lower than typical ensembles and can easily scale up to lifelong learning on Split-ImageNet which involves 100 sequential learning tasks.
Averaging Weights Leads to Wider Optima and Better Generalization
TLDR
It is shown that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training, and Stochastic Weight Averaging (SWA) is extremely easy to implement, improves generalization, and has almost no computational overhead.
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Lookahead Optimizer: k steps forward, 1 step back
TLDR
Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost, and can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings.
A Simple Baseline for Bayesian Uncertainty in Deep Learning
TLDR
It is demonstrated that SWAG performs well on a wide variety of tasks, including out of sample detection, calibration, and transfer learning, in comparison to many popular alternatives including MC dropout, KFAC Laplace, SGLD, and temperature scaling.
Learning Implicitly Recurrent CNNs Through Parameter Sharing
TLDR
A parameter sharing scheme, in which different layers of a convolutional neural network (CNN) are defined by a learned linear combination of parameter tensors from a global bank of templates, which yields a flexible hybridization of traditional CNNs and recurrent networks.
Wide Residual Networks
TLDR
This paper conducts a detailed experimental study on the architecture of ResNet blocks and proposes a novel architecture where the depth and width of residual networks are decreased and the resulting network structures are called wide residual networks (WRNs), which are far superior over their commonly used thin and very deep counterparts.
Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
TLDR
It is experimentally demonstrate that optimization algorithms that employ preconditioning, specifically Adam and K-FAC, result in much larger critical batch sizes than stochastic gradient descent with momentum.
Second-order Optimization for Neural Networks
Fast Context Adaptation via Meta-Learning
TLDR
It is shown empirically that CAVIA outperforms MAML on regression, classification, and reinforcement learning problems and is easier to implement, and is more robust to the inner-loop learning rate.
...
...