# Collegial Ensembles

@article{Littwin2020CollegialE, title={Collegial Ensembles}, author={Etai Littwin and Ben Myara and Sima Sabah and Joshua M. Susskind and Shuangfei Zhai and Oren Golan}, journal={ArXiv}, year={2020}, volume={abs/2006.07678} }

Modern neural network performance typically improves as model size increases. A recent line of research on the Neural Tangent Kernel (NTK) of over-parameterized networks indicates that the improvement with size increase is a product of a better conditioned loss landscape. In this work, we investigate a form of over-parameterization achieved through ensembling, where we define collegial ensembles (CE) as the aggregation of multiple independent models with identical architectures, trained as a…

## 3 Citations

### On the reversed bias-variance tradeoff in deep ensembles

- Computer ScienceICML 2021
- 2021

It is shown that under practical assumptions in the overparametrized regime far into the double descent curve, not only the ensemble test loss degrades, but common out-of-distribution detection and calibration metrics suffer as well, suggesting deep ensembles can benefit from early stopping.

### Representation mitosis in wide neural networks

- Computer ScienceArXiv
- 2021

It is shown that a key ingredient to activate mitosis is continuing the training process until the training error is zero, and that in one of the learning tasks, a wide model with several automatically developed clones performs significantly better than a deep ensemble based on architectures in which the last layer has the same size as the clones.

### Deep Gaussian Denoiser Epistemic Uncertainty and Decoupled Dual-Attention Fusion

- Computer Science2021 IEEE International Conference on Image Processing (ICIP)
- 2021

This work proposes a model-agnostic approach for reducing epistemic uncertainty while using only a single pretrained network, and proposes an ensemble method with two decoupled attention paths, over the pixel domain and over that of the different manipulations, to learn the final fusion.

## References

SHOWING 1-10 OF 25 REFERENCES

### SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

- Computer ScienceICLR
- 2018

This work proves convergence rates of SGD to a global minimum and provides generalization guarantees for this global minimum that are independent of the network size, and shows that SGD can avoid overfitting despite the high capacity of the model.

### Residual Tangent Kernels

- Computer ScienceArXiv
- 2020

This work derives the form of the limiting kernel for architectures incorporating bypass connections, namely residual networks (ResNets) as well as to densely connected networks (DenseNets), and shows that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity, provided proper initialization.

### Understanding deep learning requires rethinking generalization

- Computer ScienceICLR
- 2017

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

### Aggregated Residual Transformations for Deep Neural Networks

- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017

On the ImageNet-1K dataset, it is empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy and is more effective than going deeper or wider when the authors increase the capacity.

### Wide Residual Networks

- Computer ScienceBMVC
- 2016

This paper conducts a detailed experimental study on the architecture of ResNet blocks and proposes a novel architecture where the depth and width of residual networks are decreased and the resulting network structures are called wide residual networks (WRNs), which are far superior over their commonly used thin and very deep counterparts.

### On Exact Computation with an Infinitely Wide Neural Net

- Computer ScienceNeurIPS
- 2019

The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which it is called Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm.

### How to Start Training: The Effect of Initialization and Architecture

- Computer ScienceNeurIPS
- 2018

This work identifies two common failure modes for early training in which the mean and variance of activations are poorly behaved and gives a rigorous proof of when it occurs at initialization and how to avoid it.

### Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes

- Computer ScienceNeurIPS
- 2019

This work introduces a language for expressing neural network computations, and it is shown that this Neural Network-Gaussian Process correspondence surprisingly extends to all modern feedforward or recurrent neural networks composed of multilayer perceptron, RNNs, and/or layer normalization.

### On the Convex Behavior of Deep Neural Networks in Relation to the Layers' Width

- Computer ScienceArXiv
- 2020

It seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G, and when initialized using common methodologies, the gradients of over-parameterized networks are approximately orthogonal to H, such that the curvature of the loss surface is strictly positive in the direction of the gradient.

### Enhanced Convolutional Neural Tangent Kernels

- Computer ScienceArXiv
- 2019

The resulting kernel, CNN-GP with LAP and horizontal flip data augmentation, achieves 89% accuracy, matching the performance of AlexNet, which is the best such result the authors know of for a classifier that is not a trained neural network.