• Corpus ID: 219687408

Collegial Ensembles

@article{Littwin2020CollegialE,
  title={Collegial Ensembles},
  author={Etai Littwin and Ben Myara and Sima Sabah and Joshua M. Susskind and Shuangfei Zhai and Oren Golan},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.07678}
}
Modern neural network performance typically improves as model size increases. A recent line of research on the Neural Tangent Kernel (NTK) of over-parameterized networks indicates that the improvement with size increase is a product of a better conditioned loss landscape. In this work, we investigate a form of over-parameterization achieved through ensembling, where we define collegial ensembles (CE) as the aggregation of multiple independent models with identical architectures, trained as a… 

On the reversed bias-variance tradeoff in deep ensembles

It is shown that under practical assumptions in the overparametrized regime far into the double descent curve, not only the ensemble test loss degrades, but common out-of-distribution detection and calibration metrics suffer as well, suggesting deep ensembles can benefit from early stopping.

Representation mitosis in wide neural networks

It is shown that a key ingredient to activate mitosis is continuing the training process until the training error is zero, and that in one of the learning tasks, a wide model with several automatically developed clones performs significantly better than a deep ensemble based on architectures in which the last layer has the same size as the clones.

Deep Gaussian Denoiser Epistemic Uncertainty and Decoupled Dual-Attention Fusion

This work proposes a model-agnostic approach for reducing epistemic uncertainty while using only a single pretrained network, and proposes an ensemble method with two decoupled attention paths, over the pixel domain and over that of the different manipulations, to learn the final fusion.

References

SHOWING 1-10 OF 25 REFERENCES

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

This work proves convergence rates of SGD to a global minimum and provides generalization guarantees for this global minimum that are independent of the network size, and shows that SGD can avoid overfitting despite the high capacity of the model.

Residual Tangent Kernels

This work derives the form of the limiting kernel for architectures incorporating bypass connections, namely residual networks (ResNets) as well as to densely connected networks (DenseNets), and shows that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity, provided proper initialization.

Understanding deep learning requires rethinking generalization

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

Aggregated Residual Transformations for Deep Neural Networks

On the ImageNet-1K dataset, it is empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy and is more effective than going deeper or wider when the authors increase the capacity.

Wide Residual Networks

This paper conducts a detailed experimental study on the architecture of ResNet blocks and proposes a novel architecture where the depth and width of residual networks are decreased and the resulting network structures are called wide residual networks (WRNs), which are far superior over their commonly used thin and very deep counterparts.

On Exact Computation with an Infinitely Wide Neural Net

The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which it is called Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm.

How to Start Training: The Effect of Initialization and Architecture

This work identifies two common failure modes for early training in which the mean and variance of activations are poorly behaved and gives a rigorous proof of when it occurs at initialization and how to avoid it.

Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes

This work introduces a language for expressing neural network computations, and it is shown that this Neural Network-Gaussian Process correspondence surprisingly extends to all modern feedforward or recurrent neural networks composed of multilayer perceptron, RNNs, and/or layer normalization.

On the Convex Behavior of Deep Neural Networks in Relation to the Layers' Width

It seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G, and when initialized using common methodologies, the gradients of over-parameterized networks are approximately orthogonal to H, such that the curvature of the loss surface is strictly positive in the direction of the gradient.

Enhanced Convolutional Neural Tangent Kernels

The resulting kernel, CNN-GP with LAP and horizontal flip data augmentation, achieves 89% accuracy, matching the performance of AlexNet, which is the best such result the authors know of for a classifier that is not a trained neural network.