• Corpus ID: 224820076

Learning Curves for Analysis of Deep Networks

  title={Learning Curves for Analysis of Deep Networks},
  author={Derek Hoiem and Tanmay Gupta and Zhizhong Li and Michal Shlapentokh-Rothman},
A learning curve models a classifier's test error as a function of the number of training samples. Prior works show that learning curves can be used to select model parameters and extrapolate performance. We investigate how to use learning curves to analyze the impact of design choices, such as pre-training, architecture, and data augmentation. We propose a method to robustly estimate learning curves, abstract their parameters into error and data-reliance, and evaluate the effectiveness of… 

Figures and Tables from this paper

The Shape of Learning Curves: a Review
This review recounts the origins of the term, provides a formal definition of the learning curve, and provides a comprehensive overview of the literature regarding the shape of learning curves.
Learning Curve Theory
This work develops and theoretically analyse the simplest possible (toy) model that can exhibit n−β learning curves for arbitrary power β > 0, and determines whether power laws are universal or depend on the data distribution.
Data Scaling Laws in NMT: The Effect of Noise and Architecture
This work establishes that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size, and systematically varies aspects of the training setup to understand how they impact the data scaling laws.
Simple Control Baselines for Evaluating Transfer Learning
This work shares an evaluation standard that aims to quantify and communicate transfer learning performance in an informative and accessible setup and encourages using/reporting the suggested control baselines in evaluating transfer learning in order to gain a more meaningful and informative understanding.
Overview of Machine Learning Process Modelling
Results are provided that can be used to assess the performance of novel or existing artificial learners and forecast their ‘capacity to learn’ based on the amount of available or desired data.


Learning Curves: Asymptotic Values and Rate of Convergence
This work proposes a practical and principled predictive method that avoids the costly procedure of training poor classifiers on the whole training set, and it is demonstrated for both single- and multi-layer networks.
Learning Multiple Layers of Features from Tiny Images
It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.
Deep Learning Scaling is Predictable, Empirically
A large scale empirical characterization of generalization error and model size growth as training sets grow is presented and it is shown that model size scales sublinearly with data size.
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
It is found that the performance on vision tasks increases logarithmically based on volume of training data size, and it is shown that representation learning (or pre-training) still holds a lot of promise.
Deep Residual Learning for Image Recognition
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Spectrally-normalized margin bounds for neural networks
This bound is empirically investigated for a standard AlexNet network trained with SGD on the mnist and cifar10 datasets, with both original and random labels; the bound, the Lipschitz constants, and the excess risks are all in direct correlation, suggesting both that SGD selects predictors whose complexity scales with the difficulty of the learning task, and that the presented bound is sensitive to this complexity.
Aggregated Residual Transformations for Deep Neural Networks
On the ImageNet-1K dataset, it is empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy and is more effective than going deeper or wider when the authors increase the capacity.
Gradient Centralization: A New Optimization Technique for Deep Neural Networks
It is shown that GC can regularize both the weight space and output feature space so that it can boost the generalization performance of DNNs, and improves the Lipschitzness of the loss function and its gradient so that the training process becomes more efficient and stable.
Very Deep Convolutional Networks for Large-Scale Image Recognition
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Stability and Generalization
These notions of stability for learning algorithms are defined and it is shown how to use these notions to derive generalization error bounds based on the empirical error and the leave-one-out error.