• Corpus ID: 202750018

# Wider Networks Learn Better Features

@article{Gilboa2019WiderNL,
title={Wider Networks Learn Better Features},
author={Dar Gilboa and Guy Gur-Ari},
journal={ArXiv},
year={2019},
volume={abs/1909.11572}
}
• Published 25 September 2019
• Computer Science
• ArXiv
Transferability of learned features between tasks can massively reduce the cost of training a neural network on a novel task. We investigate the effect of network width on learned features using activation atlases --- a visualization technique that captures features the entire hidden state responds to, as opposed to individual neurons alone. We find that, while individual neurons do not learn interpretable features in wide networks, groups of neurons do. In addition, the hidden state of a wide…
5 Citations

## Figures from this paper

Thick-Net: Parallel Network Structure for Sequential Modeling
• Computer Science
ArXiv
• 2019
Thick-Net is a simple new model named Thick-Net, by expanding the network from another dimension: thickness, which can efficiently avoid overfitting, and is easier to optimize than the vanilla structures due to the large dropout affiliated with it.
Feature Learning in Infinite-Width Neural Networks
• Computer Science
ArXiv
• 2020
It is shown that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT, and any such infinite- width limit can be computed using the Tensor Programs technique.
Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks
• Computer Science
ICML
• 2021
On Word2Vec and few-shot learning on Omniglot via MAML, two canonical tasks that rely crucially on feature learning, explicit formulas for infinite-width limits are derived exactly and are found to outperform both NTK baselines and finite-width networks.
Distributional Generalization: A New Kind of Generalization
• Computer Science, Mathematics
ArXiv
• 2020
We introduce a new notion of generalization -- Distributional Generalization -- which roughly states that outputs of a classifier at train and test time are close *as distributions*, as opposed to
Analyzing Effect on Residual Learning by Gradual Narrowing Fully-Connected Layer Width and Implementing Inception Block in Convolution Layer
Results show that ResNet50 architecture achieved improved accuracy and declined error rate if gradually narrowing FC layers are employed between core residual learning schema and output layer, and performance improvements were achieved without regularization.

## References

SHOWING 1-10 OF 34 REFERENCES
Intriguing properties of neural networks
• Computer Science
ICLR
• 2014
It is found that there is no distinction between individual highlevel units and random linear combinations of high level units, according to various methods of unit analysis, and it is suggested that it is the space, rather than the individual units, that contains of the semantic information in the high layers of neural networks.
Synthesizing the preferred inputs for neurons in neural networks via deep generator networks
• Computer Science
NIPS
• 2016
This work dramatically improves the qualitative state of the art of activation maximization by harnessing a powerful, learned prior: a deep generator network (DGN), which generates qualitatively state-of-the-art synthetic images that look almost real.
Learning Transferable Features with Deep Adaptation Networks
• Computer Science
ICML
• 2015
A new Deep Adaptation Network (DAN) architecture is proposed, which generalizes deep convolutional neural network to the domain adaptation scenario and can learn transferable features with statistical guarantees, and can scale linearly by unbiased estimate of kernel embedding.
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
• Computer Science
ICML
• 2017
We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
• Computer Science
NeurIPS
• 2019
It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples.
Visualizing Higher-Layer Features of a Deep Network
• Computer Science
• 2009
This paper contrast and compare several techniques applied on Stacked Denoising Autoencoders and Deep Belief Networks, trained on several vision datasets, and shows that good qualitative interpretations of high level features represented by such models are possible at the unit level.
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
• Computer Science
NeurIPS
• 2018
It is proved that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels, when the data comes from mixtures of well-separated distributions.
Opening the Black Box of Deep Neural Networks via Information
• Computer Science
ArXiv
• 2017
This work demonstrates the effectiveness of the Information-Plane visualization of DNNs and shows that the training time is dramatically reduced when adding more hidden layers, and the main advantage of the hidden layers is computational.
Deep neural networks are easily fooled: High confidence predictions for unrecognizable images
• Computer Science
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
• 2015
This work takes convolutional neural networks trained to perform well on either the ImageNet or MNIST datasets and finds images with evolutionary algorithms or gradient ascent that DNNs label with high confidence as belonging to each dataset class, and produces fooling images, which are then used to raise questions about the generality of DNN computer vision.
A Convergence Theory for Deep Learning via Over-Parameterization
• Computer Science
ICML
• 2019
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.