• Corpus ID: 202750018

Wider Networks Learn Better Features

  title={Wider Networks Learn Better Features},
  author={Dar Gilboa and Guy Gur-Ari},
Transferability of learned features between tasks can massively reduce the cost of training a neural network on a novel task. We investigate the effect of network width on learned features using activation atlases --- a visualization technique that captures features the entire hidden state responds to, as opposed to individual neurons alone. We find that, while individual neurons do not learn interpretable features in wide networks, groups of neurons do. In addition, the hidden state of a wide… 

Figures from this paper

Thick-Net: Parallel Network Structure for Sequential Modeling
Thick-Net is a simple new model named Thick-Net, by expanding the network from another dimension: thickness, which can efficiently avoid overfitting, and is easier to optimize than the vanilla structures due to the large dropout affiliated with it.
Feature Learning in Infinite-Width Neural Networks
It is shown that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT, and any such infinite- width limit can be computed using the Tensor Programs technique.
Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks
On Word2Vec and few-shot learning on Omniglot via MAML, two canonical tasks that rely crucially on feature learning, explicit formulas for infinite-width limits are derived exactly and are found to outperform both NTK baselines and finite-width networks.
Distributional Generalization: A New Kind of Generalization
We introduce a new notion of generalization -- Distributional Generalization -- which roughly states that outputs of a classifier at train and test time are close *as distributions*, as opposed to
Analyzing Effect on Residual Learning by Gradual Narrowing Fully-Connected Layer Width and Implementing Inception Block in Convolution Layer
Results show that ResNet50 architecture achieved improved accuracy and declined error rate if gradually narrowing FC layers are employed between core residual learning schema and output layer, and performance improvements were achieved without regularization.


Intriguing properties of neural networks
It is found that there is no distinction between individual highlevel units and random linear combinations of high level units, according to various methods of unit analysis, and it is suggested that it is the space, rather than the individual units, that contains of the semantic information in the high layers of neural networks.
Synthesizing the preferred inputs for neurons in neural networks via deep generator networks
This work dramatically improves the qualitative state of the art of activation maximization by harnessing a powerful, learned prior: a deep generator network (DGN), which generates qualitatively state-of-the-art synthetic images that look almost real.
Learning Transferable Features with Deep Adaptation Networks
A new Deep Adaptation Network (DAN) architecture is proposed, which generalizes deep convolutional neural network to the domain adaptation scenario and can learn transferable features with statistical guarantees, and can scale linearly by unbiased estimate of kernel embedding.
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples.
Visualizing Higher-Layer Features of a Deep Network
This paper contrast and compare several techniques applied on Stacked Denoising Autoencoders and Deep Belief Networks, trained on several vision datasets, and shows that good qualitative interpretations of high level features represented by such models are possible at the unit level.
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
It is proved that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels, when the data comes from mixtures of well-separated distributions.
Opening the Black Box of Deep Neural Networks via Information
This work demonstrates the effectiveness of the Information-Plane visualization of DNNs and shows that the training time is dramatically reduced when adding more hidden layers, and the main advantage of the hidden layers is computational.
Scaling description of generalization with number of parameters in deep learning
This work relies on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations of the neural net output function f N around its expectation, which affects the generalization error for classification.
Deep neural networks are easily fooled: High confidence predictions for unrecognizable images
This work takes convolutional neural networks trained to perform well on either the ImageNet or MNIST datasets and finds images with evolutionary algorithms or gradient ascent that DNNs label with high confidence as belonging to each dataset class, and produces fooling images, which are then used to raise questions about the generality of DNN computer vision.