• Corpus ID: 239998136

Diversity Matters When Learning From Ensembles

@article{Nam2021DiversityMW,
  title={Diversity Matters When Learning From Ensembles},
  author={Gi Cheon Nam and Jongmin Yoon and Yoonho Lee and Juho Lee},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.14149}
}
Deep ensembles excel in large-scale image classification tasks both in terms of prediction accuracy and calibration. Despite being simple to train, the computation and memory cost of deep ensembles limits their practicability. While some recent works propose to distill an ensemble model into a single model to reduce such costs, there is still a performance gap between the ensemble and distilled models. We propose a simple approach for reducing this gap, i.e., making the distilled performance… 

Improving Ensemble Distillation With Weight Averaging and Diversifying Perturbation

A weight averaging technique where a student with multiple subnetworks is trained to absorb the functional diversity of ensemble teachers, but then those subnets are properly averaged for inference, giving a single student network with no additional inference cost is proposed.

Functional Ensemble Distillation

This work investigates how to best distill an ensemble’s predictions using an efficient model and proposes a novel and general distillation approach, named Functional Ensemble Distillation (FED), and finds that learning the distilled model via a simple augmentation scheme in the form of mixup augmentation significantly boosts the performance.

Verification-Aided Deep Ensemble Selection

This case study harnesses recent advances in DNN verification to devise a methodology for identifying ensemble compositions that are less prone to simultaneous errors, even when the input is adversarially perturbed — resulting in more robustly-accurate ensemble-based classiﷁcation.

Sparse MoEs meet Efficient Ensembles

Partitioned batch ensembles is presented, an efficient ensemble of sparse MoEs that takes the best of both classes of models, and not only scale to models with up to 2.7B parameters, but also provide larger performance gains for larger models.

Deep Isolation Forest for Anomaly Detection

A new representation scheme that utilises casually initialised neural networks to map original data into random representation ensembles, where random axis-parallel cuts are subsequently applied to perform the data partition, encouraging a unique synergy between random representations and random partition-based isolation.

New Classification Method for Independent Data Sources Using Pawlak Conflict Model and Decision Trees

In the paper, it was shown that the proposed approach provides a significant improvement in classification quality and execution time and will result in reduced computational complexity—a reduced number of classifiers will be built.

Unsupervised Machine Learning for Explainable Medicare Fraud Detection

Novel machine learning tools are developed to identify providers that overbill Medicare, the US federal health insurance program for elderly adults and the disabled, using large-scale Medicare claims data to identify patterns consistent with fraud or overbilling among inpatient hospitalizations.

References

SHOWING 1-10 OF 26 REFERENCES

Deep Ensembles: A Loss Landscape Perspective

Developing the concept of the diversity--accuracy plane, it is shown that the decorrelation power of random initializations is unmatched by popular subspace sampling methods and the experimental results validate the hypothesis that deep ensembles work well under dataset shift.

Hydra: Preserving Ensemble Diversity for Model Distillation

This work proposes a distillation method based on a single multi-headed neural network that improves distillation performance on classification and regression settings while capturing the uncertainty behaviour of the original ensemble over both in-domain and out-of-distribution tasks.

Ensemble Distribution Distillation

A solution for EnD$^2, a class of models which allow a single neural network to explicitly model a distribution over output distributions, is proposed in this work, which enables a single model to retain both the improved classification performance of ensemble distillation as well as information about the diversity of the ensemble, which is useful for uncertainty estimation.

Distilling Ensembles Improves Uncertainty Estimates

This work obtains negative theoretical results on the possibility of approximating deep ensemble weights by batch ensemble weights, and so turns to distillation.

Knowledge Distillation Thrives on Data Augmentation

This paper shows that KD loss can benefit from extended training iterations while the cross-entropy loss does not, and proposes to enhance KD via a stronger data augmentation scheme (e.g., mixup, CutMix), which may inspire more advanced KD algorithms.

Distilling the Knowledge in a Neural Network

This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.

Bayesian Deep Learning and a Probabilistic Perspective of Generalization

It is shown that deep ensembles provide an effective mechanism for approximate Bayesian marginalization, and a related approach is proposed that further improves the predictive distribution by marginalizing within basins of attraction, without significant overhead.

Training independent subnetworks for robust prediction

This work shows that, using a multi-input multi-output (MIMO) configuration, one can utilize a single model's capacity to train multiple subnetworks that independently learn the task at hand, and improves model robustness without increasing compute.

Wide Residual Networks

This paper conducts a detailed experimental study on the architecture of ResNet blocks and proposes a novel architecture where the depth and width of residual networks are decreased and the resulting network structures are called wide residual networks (WRNs), which are far superior over their commonly used thin and very deep counterparts.

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

This work proposes an alternative to Bayesian NNs that is simple to implement, readily parallelizable, requires very little hyperparameter tuning, and yields high quality predictive uncertainty estimates.