On the information bottleneck theory of deep learning

@article{Saxe2018OnTI,
  title={On the information bottleneck theory of deep learning},
  author={Andrew M. Saxe and Yamini Bansal and Joel Dapello and Madhu S. Advani and Artemy Kolchinsky and Brendan D. Tracey and David D. Cox},
  journal={Journal of Statistical Mechanics: Theory and Experiment},
  year={2018},
  volume={2019}
}
The practical successes of deep neural networks have not been matched by theoretical progress that satisfyingly explains their behavior. In this work, we study the information bottleneck (IB) theory of deep learning, which makes three specific claims: first, that deep networks undergo two distinct phases consisting of an initial fitting phase and a subsequent compression phase; second, that the compression phase is causally related to the excellent generalization performance of deep networks… 

Information Bottleneck: Exact Analysis of (Quantized) Neural Networks

TLDR
This study monitors the dynamics of quantized neural networks so that the whole deep learning system is discretized so that no approximation is required when computing the MI, and shows that the initial IB results were not artifacts of binning when Computing the MI.

PRESSION IN DEEP NEURAL NETWORKS

TLDR
More robust mutual information estimation techniques are developed, that adapt to hidden activity of neural networks and produce more sensitive measurements of activations from all functions, especially unbounded functions, which explore compression in networks with a range of different activation functions.

Entropy and mutual information in models of deep neural networks

TLDR
It is concluded that, in the proposed setting, the relationship between compression and generalization remains elusive and an experiment framework with generative models of synthetic datasets is proposed, on which deep neural networks are trained with a weight constraint designed so that the assumption in (i) is verified during learning.

(QUANTIZED) NEURAL NETWORKS

TLDR
This study monitors the dynamics of quantized neural networks so that the whole deep learning system is discretized so that no approximation is required when computing the MI, and shows that the initial IB results were not artifacts of binning when Computing the MI.

Understanding Learning Dynamics of Binary Neural Networks via Information Bottleneck

TLDR
This paper analyzes Binary Neural Networks through the Information Bottleneck principle and observes that the training dynamics of BNNs is considerably different from that of Deep Neural Networks (DNNs), while DNNs have a separate empirical risk minimization and representation compression phases.

Estimating Information Flow in Neural Networks

TLDR
An auxiliary (noisy) DNN framework is introduced, and a rigorous estimator for I(X;T) in noisy DNNs is developed, which clarifies the past observations of compression and isolates the geometric clustering of hidden representations as the true phenomenon of interest.

A Critical Review of Information Bottleneck Theory and its Applications to Deep Learning

TLDR
A comprehensive review of IB theory covering it’s information theoretic roots and the recently proposed applications to understand deep learning models is provided.

Estimating Information Flow in Deep Neural Networks

TLDR
It is revealed that compression, i.e. reduction in I(X;T`) over the course of training, is driven by progressive geometric clustering of the representations of samples from the same class, and new evidence that compression and generalization may not be causally related is provided.

The Gaussian equivalence of generative models for learning with shallow neural networks

TLDR
This work establishes rigorous conditions for the Gaussian equivalence to hold in the case of single-layer generative models, as well as deterministic rates for convergence in distribution, and derives a closed set of equations describing the generalisation performance of two widely studied machine learning problems.

Adaptive Estimators Show Information Compression in Deep Neural Networks

TLDR
More robust mutual information estimation techniques are developed, that adapt to hidden activity of neural networks and produce more sensitive measurements of activations from all functions, especially unbounded functions, which explore compression in networks with a range of different activation functions.
...

References

SHOWING 1-10 OF 30 REFERENCES

Deep learning and the information bottleneck principle

TLDR
It is argued that both the optimal architecture, number of layers and features/connections at each layer, are related to the bifurcation points of the information bottleneck tradeoff, namely, relevant compression of the input layer with respect to the output layer.

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

TLDR
It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.

Opening the Black Box of Deep Neural Networks via Information

TLDR
This work demonstrates the effectiveness of the Information-Plane visualization of DNNs and shows that the training time is dramatically reduced when adding more hidden layers, and the main advantage of the hidden layers is computational.

High-dimensional dynamics of generalization error in neural networks

Optimal Architectures in a Solvable Model of Deep Networks

TLDR
This work provides analytically derived recursion relations describing the propagation of the signals along the deep network and shows that these model networks have optimal depths.

The Loss Surfaces of Multilayer Networks

TLDR
It is proved that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.

Sharp Minima Can Generalize For Deep Nets

TLDR
It is argued that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization, and when focusing on deep networks with rectifier units, the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit is exploited.

On the Emergence of Invariance and Disentangling in Deep Representations

TLDR
It is shown that invariance in a deep neural network is equivalent to minimality of the representation it computes, and can be achieved by stacking layers and injecting noise in the computation, under realistic and empirically validated assumptions.

Statistical mechanics of learning from examples.

TLDR
It is shown that for smooth networks, i.e., those with continuously varying weights and smooth transfer functions, the generalization curve asymptotically obeys an inverse power law, while for nonsmooth networks other behaviors can appear, depending on the nature of the nonlinearities as well as the realizability of the rule.

On the Number of Linear Regions of Deep Neural Networks

We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep