# On the information bottleneck theory of deep learning

@article{Saxe2018OnTI, title={On the information bottleneck theory of deep learning}, author={Andrew M. Saxe and Yamini Bansal and Joel Dapello and Madhu S. Advani and Artemy Kolchinsky and Brendan D. Tracey and David D. Cox}, journal={Journal of Statistical Mechanics: Theory and Experiment}, year={2018}, volume={2019} }

The practical successes of deep neural networks have not been matched by theoretical progress that satisfyingly explains their behavior. In this work, we study the information bottleneck (IB) theory of deep learning, which makes three specific claims: first, that deep networks undergo two distinct phases consisting of an initial fitting phase and a subsequent compression phase; second, that the compression phase is causally related to the excellent generalization performance of deep networks…

## 353 Citations

### Information Bottleneck: Exact Analysis of (Quantized) Neural Networks

- Computer ScienceICLR
- 2022

This study monitors the dynamics of quantized neural networks so that the whole deep learning system is discretized so that no approximation is required when computing the MI, and shows that the initial IB results were not artifacts of binning when Computing the MI.

### PRESSION IN DEEP NEURAL NETWORKS

- Computer Science
- 2019

More robust mutual information estimation techniques are developed, that adapt to hidden activity of neural networks and produce more sensitive measurements of activations from all functions, especially unbounded functions, which explore compression in networks with a range of different activation functions.

### Entropy and mutual information in models of deep neural networks

- Computer ScienceNeurIPS
- 2018

It is concluded that, in the proposed setting, the relationship between compression and generalization remains elusive and an experiment framework with generative models of synthetic datasets is proposed, on which deep neural networks are trained with a weight constraint designed so that the assumption in (i) is verified during learning.

### (QUANTIZED) NEURAL NETWORKS

- Computer Science
- 2022

This study monitors the dynamics of quantized neural networks so that the whole deep learning system is discretized so that no approximation is required when computing the MI, and shows that the initial IB results were not artifacts of binning when Computing the MI.

### Understanding Learning Dynamics of Binary Neural Networks via Information Bottleneck

- Computer ScienceArXiv
- 2020

This paper analyzes Binary Neural Networks through the Information Bottleneck principle and observes that the training dynamics of BNNs is considerably different from that of Deep Neural Networks (DNNs), while DNNs have a separate empirical risk minimization and representation compression phases.

### Estimating Information Flow in Neural Networks

- Computer ScienceArXiv
- 2018

An auxiliary (noisy) DNN framework is introduced, and a rigorous estimator for I(X;T) in noisy DNNs is developed, which clarifies the past observations of compression and isolates the geometric clustering of hidden representations as the true phenomenon of interest.

### A Critical Review of Information Bottleneck Theory and its Applications to Deep Learning

- Computer ScienceArXiv
- 2021

A comprehensive review of IB theory covering it’s information theoretic roots and the recently proposed applications to understand deep learning models is provided.

### Estimating Information Flow in Deep Neural Networks

- Computer ScienceICML
- 2019

It is revealed that compression, i.e. reduction in I(X;T`) over the course of training, is driven by progressive geometric clustering of the representations of samples from the same class, and new evidence that compression and generalization may not be causally related is provided.

### The Gaussian equivalence of generative models for learning with shallow neural networks

- Computer ScienceMSML
- 2021

This work establishes rigorous conditions for the Gaussian equivalence to hold in the case of single-layer generative models, as well as deterministic rates for convergence in distribution, and derives a closed set of equations describing the generalisation performance of two widely studied machine learning problems.

### Adaptive Estimators Show Information Compression in Deep Neural Networks

- Computer ScienceICLR
- 2019

More robust mutual information estimation techniques are developed, that adapt to hidden activity of neural networks and produce more sensitive measurements of activations from all functions, especially unbounded functions, which explore compression in networks with a range of different activation functions.

## References

SHOWING 1-10 OF 30 REFERENCES

### Deep learning and the information bottleneck principle

- Computer Science2015 IEEE Information Theory Workshop (ITW)
- 2015

It is argued that both the optimal architecture, number of layers and features/connections at each layer, are related to the bifurcation points of the information bottleneck tradeoff, namely, relevant compression of the input layer with respect to the output layer.

### Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

- Computer ScienceICLR
- 2014

It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.

### Opening the Black Box of Deep Neural Networks via Information

- Computer ScienceArXiv
- 2017

This work demonstrates the effectiveness of the Information-Plane visualization of DNNs and shows that the training time is dramatically reduced when adding more hidden layers, and the main advantage of the hidden layers is computational.

### High-dimensional dynamics of generalization error in neural networks

- Computer ScienceNeural Networks
- 2020

### Optimal Architectures in a Solvable Model of Deep Networks

- Computer ScienceNIPS
- 2016

This work provides analytically derived recursion relations describing the propagation of the signals along the deep network and shows that these model networks have optimal depths.

### The Loss Surfaces of Multilayer Networks

- Computer ScienceAISTATS
- 2015

It is proved that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.

### Sharp Minima Can Generalize For Deep Nets

- Computer ScienceICML
- 2017

It is argued that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization, and when focusing on deep networks with rectifier units, the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit is exploited.

### On the Emergence of Invariance and Disentangling in Deep Representations

- Computer ScienceArXiv
- 2017

It is shown that invariance in a deep neural network is equivalent to minimality of the representation it computes, and can be achieved by stacking layers and injecting noise in the computation, under realistic and empirically validated assumptions.

### Statistical mechanics of learning from examples.

- Computer SciencePhysical review. A, Atomic, molecular, and optical physics
- 1992

It is shown that for smooth networks, i.e., those with continuously varying weights and smooth transfer functions, the generalization curve asymptotically obeys an inverse power law, while for nonsmooth networks other behaviors can appear, depending on the nature of the nonlinearities as well as the realizability of the rule.

### On the Number of Linear Regions of Deep Neural Networks

- Computer ScienceNIPS
- 2014

We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep…