• Corpus ID: 225075982

Are wider nets better given the same number of parameters?

  title={Are wider nets better given the same number of parameters?},
  author={Anna Golubeva and Behnam Neyshabur and Guy Gur-Ari},
Empirical studies demonstrate that the performance of neural networks improves with increasing number of parameters. In most of these studies, the number of parameters is increased by increasing the network width. This begs the question: Is the observed improvement due to the larger number of parameters, or is it due to the larger width itself? We compare different ways of increasing model width while keeping the number of parameters constant. We show that for models initialized with a random… 

Over-Parameterization and Generalization in Audio Classification

This study investigates the relationship between over-parameterization of acoustic scene classification models, and their resulting generalization abilities and indicates that increasing width improves generalization to unseen devices, even without an increase in the number of parameters.

Understanding the effect of sparsity on neural networks robustness

The hypothesis that up to a certain sparsity achieved by increasing network width and depth while keeping the network capacity, sparsified networks consistently match and often outperform their initially dense versions is shown to hold.

The Final Ascent: When Bigger Models Generalize Worse on Noisy-Labeled Data

This work shows that under a sufficiently large noise-to-sample size ratio, generalization error eventually increases with model size, and empirically observes that the adverse effect of network size is more pronounced when robust training methods are employed to learn from noisy-labeled data.

PHEW : Constructing Sparse Networks that Learn Fast and Generalize Well without Training Data

This work shows that even though Synflow-L2 is optimal in terms of convergence, for a given network density, it results in sub-networks with “bottleneck” (narrow) layers – leading to poor performance as compared to other data-agnostic methods that use the same number of parameters.

Deep Learning Meets Sparse Regularization: A Signal Processing Perspective

A relatively new mathematical framework is presented that provides the beginning of a deeper understanding of deep learning and precisely characterizes the functional properties of neural networks that are trained to respond to data.

How Erd\"os and R\'enyi Win the Lottery

This work is the first to show theoretically and experimentally that random ER source networks contain strong lottery tickets, and proves the existence of weak lottery tickets that require a lower degree of overparametrization than strong lotteryTickets.

DCI-ES: An Extended Disentanglement Framework with Connections to Identifiability

This work establishes a formal link between disentanglement and the closely-related field of independent component analysis and proposes an extended DCI-ES framework with two new measures of representation quality—explicitness and size—and point out how D and C can be computed for black-box predictors.

Predicting generalization with degrees of freedom in neural networks

This work introduces an empirical complexity measure inspired by the classical notion of generalized degrees of freedom in statistics that can be approximated efficiently and is a function of the entire machine learning training pipeline and demonstrates that this measure correlates with generalization performance in the double-descent regime.

Laziness, Barren Plateau, and Noise in Machine Learning

This work precisely reformulates the quantum barren plateau statement towards a precision statement and injects new hope toward near-term variational quantum algorithms, and provides theoretical connections toward classical machine learning.

SensorFormer: Efficient Many-to-Many Sensor Calibration With Learnable Input Subsampling

It is argued that both recent past and close future measurements help to achieve accurate calibration, whereas accuracy improvements beyond the past come with a delay introduced by the occurrence of the future.



Pruning Neural Networks at Initialization: Why are We Missing the Mark?

It is shown that, unlike pruning after training, accuracy is the same or higher when randomly shuffling which weights these methods prune within each layer or sampling new initial values, undermining the claimed justifications for these methods and suggesting broader challenges with the underlying pruning heuristics.

Finite Versus Infinite Neural Networks: an Empirical Study

Improved best practices for using NNGP and NT kernels for prediction are developed, including a novel ensembling technique that achieves state-of-the-art results on CIFAR-10 classification for kernels corresponding to each architecture class the authors consider.

Pruning neural networks without any data by iteratively conserving synaptic flow

The data-agnostic pruning algorithm challenges the existing paradigm that, at initialization, data must be used to quantify which synapses are important, and consistently competes with or outperforms existing state-of-the-art pruning algorithms at initialization over a range of models, datasets, and sparsity constraints.

Picking Winning Tickets Before Training by Preserving Gradient Flow

This work argues that efficient training requires preserving the gradient flow through the network, and proposes a simple but effective pruning criterion called Gradient Signal Preservation (GraSP), which achieves significantly better performance than the baseline at extreme sparsity levels.

Scaling Laws for Neural Language Models

Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

Rigging the Lottery: Making All Tickets Winners

This paper introduces a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods.

Fast Sparse ConvNets

This work introduces a family of efficient sparse kernels for several hardware platforms, and shows that sparse versions of MobileNet v1 and Mobile net v2 architectures substantially outperform strong dense baselines on the efficiency-accuracy curve.

Drawing early-bird tickets: Towards more efficient training of deep networks

This paper discovers for the first time that the winning tickets can be identified at the very early training stage, which it is term as early-bird (EB) tickets, via low-cost training schemes at large learning rates, consistent with recently reported observations that the key connectivity patterns of neural networks emerge early.

Reconciling modern machine-learning practice and the classical bias–variance trade-off

This work shows how classical theory and modern practice can be reconciled within a single unified performance curve and proposes a mechanism underlying its emergence, and provides evidence for the existence and ubiquity of double descent for a wide spectrum of models and datasets.