• Corpus ID: 211132990

BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning

@article{Wen2020BatchEnsembleAA,
  title={BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning},
  author={Yeming Wen and Dustin Tran and Jimmy Ba},
  journal={ArXiv},
  year={2020},
  volume={abs/2002.06715}
}
Ensembles, where multiple neural networks are trained individually and their predictions are averaged, have been shown to be widely successful for improving both the accuracy and predictive uncertainty of single neural networks. However, an ensemble's cost for both training and testing increases linearly with the number of networks, which quickly becomes untenable. In this paper, we propose BatchEnsemble, an ensemble method whose computational and memory costs are significantly lower than… 
Deep Ensembling with No Overhead for either Training or Testing: The All-Round Blessings of Dynamic Sparsity
TLDR
This work draws a unique connection between sparse neural network training and deep ensembles, yielding a novel efficient ensemble learning framework called FreeT ickets, which surpasses the dense baseline in all the following criteria: prediction accuracy, uncertainty estimation, out-of-distribution (OoD) robustness, as well as efficiency for both training and inference.
Deep Ensembles Work, But Are They Necessary?
TLDR
While deep ensembles are a practical way to achieve performance improvement, the results show that they may be a tool of convenience rather than a fundamentally better model class.
Economical ensembles with hypernetworks
TLDR
The proposed method generates an ensemble by randomly initializing an additional number of weight embeddings in the vicinity of each other, and exploits the inherent randomness in stochastic gradient descent to induce ensemble diversity.
Diversity and Generalization in Neural Network Ensembles
TLDR
This work combines and expands previously published results in a theoretically sound framework that describes the relationship between diversity and ensemble performance for a wide range of ensemble methods and empirically validate this theoretical analysis with neural network ensembles.
FreeTickets: Accurate, Robust and Efficient Deep Ensemble by Training with Dynamic Sparsity
TLDR
This work introduces the FreeT ickets concept, as the first solution which can boost the performance of sparse convolutional neural networks over their dense network equivalents by a large margin, while using for complete training only a fraction of the computational resources required by the latter.
Deep Ensembles for Low-Data Transfer Learning
TLDR
This work shows that the nature of pre-training itself is a performant source of diversity, and proposes a practical algorithm that efficiently identifies a subset ofPre-trained models for any downstream dataset and achieves state-of-the-art performance at a much lower inference budget.
Multi-headed Neural Ensemble Search
TLDR
This work extends Neural Ensemble Search to multiheaded ensembles, which consist of a shared backbone attached to multiple prediction heads, which enables us to leverage one-shot NAS methods to optimize an ensemble objective.
Towards a robust experimental framework and benchmark for lifelong language learning
TLDR
An experimental framework for true, general lifelong learning in NLP is proposed and shortcomings of prevalent memory-based solutions are revealed, demonstrating they are unable to outperform a simple experience replay baseline under the realistic lifelong learning setup.
Diversity Matters When Learning From Ensembles
TLDR
This work empirically shows that a model distilled with such perturbed samples indeed exhibits enhanced diversity, leading to improved performance, and proposes a perturbation strategy for distillation that reveals diversity by seeking inputs for which ensemble member outputs disagree.
Sparse MoEs meet Efficient Ensembles
TLDR
Partitioned batch ensembles is presented, an efficient ensemble of sparse MoEs that takes the best of both classes of models, and not only scale to models with up to 2.7B parameters, but also provide larger performance gains for larger models.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 73 REFERENCES
Snapshot Ensembles: Train 1, get M for free
TLDR
This paper proposes a method to obtain the seemingly contradictory goal of ensembling multiple neural networks at no additional training cost by training a single neural network, converging to several local minima along its optimization path and saving the model parameters.
Distilling the Knowledge in a Neural Network
TLDR
This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.
Temporal Ensembling for Semi-Supervised Learning
TLDR
Self-ensembling is introduced, where it is shown that this ensemble prediction can be expected to be a better predictor for the unknown labels than the output of the network at the most recent training epoch, and can thus be used as a target for training.
Efficient Lifelong Learning with A-GEM
TLDR
An improved version of GEM is proposed, dubbed Averaged GEM (A-GEM), which enjoys the same or even better performance as GEM, while being almost as computationally and memory efficient as EWC and other regularization-based methods.
Lifelong Learning with Dynamically Expandable Networks
TLDR
The obtained network fine-tuned on all tasks obtained significantly better performance over the batch models, which shows that it can be used to estimate the optimal network structure even when all tasks are available in the first place.
Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches
Stochastic neural net weights are used in a variety of contexts, including regularization, Bayesian neural nets, exploration in reinforcement learning, and evolution strategies. Unfortunately, due to
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Overcoming catastrophic forgetting in neural networks
TLDR
It is shown that it is possible to overcome the limitation of connectionist models and train networks that can maintain expertise on tasks that they have not experienced for a long time and selectively slowing down learning on the weights important for previous tasks.
K For The Price Of 1: Parameter Efficient Multi-task And Transfer Learning
TLDR
A novel method that enables parameter-efficient transfer and multi-task learning with deep neural networks and shows that re-learning existing low-parameter layers while keeping the rest of the network frozen also improves transfer-learning accuracy significantly.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
TLDR
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
...
1
2
3
4
5
...