Exact Solutions of a Deep Linear Network
@article{Ziyin2022ExactSO, title={Exact Solutions of a Deep Linear Network}, author={Liu Ziyin and Botao Li and Xiangmin Meng}, journal={ArXiv}, year={2022}, volume={abs/2202.04777} }
This work finds the analytical expression of the global minima of a deep linear network with weight decay and stochastic neurons, a fundamental model for understanding the landscape of neural networks. Our result implies that zero is a special point in deep neural network architecture. We show that weight decay strongly interacts with the model architecture and can create bad minima at zero in a network with more than $1$ hidden layer, qualitatively different from a network with only $1$ hidden…
7 Citations
The Probabilistic Stability of Stochastic Gradient Descent
- Computer Science
- 2023
Only under the lens of probabilistic stability does SGD exhibit rich and practically relevant phases of learning, such as the phases of the complete loss of stability, incorrect learning, convergence to low-rank saddles, and correct learning.
SGD WITH A C ONSTANT L ARGE L EARNING R ATE C AN C ONVERGE TO L OCAL M AXIMA
- Computer Science
- 2022
This work constructs worst-case optimization problems illustrating that, when not in the regimes that the previous works often assume, SGD can exhibit many strange and potentially undesirable behaviors.
SGD with a Constant Large Learning Rate Can Converge to Local Maxima
- Computer Science
- 2021
This work constructs worst-case optimization problems illustrating that, when not in the regimes that the previous works often assume, SGD can exhibit many strange and potentially undesirable behaviors.
Sparsity by Redundancy: Solving L1 with a Simple Reparametrization
- Computer ScienceArXiv
- 2022
The results lead to a simple algorithm, spred, that seamlessly integrates L 1 regularization into any modern deep learning framework, and bridges the gap in understanding the inductive bias of the redundant parametrization common in deep learning and conventional statistical learning.
What shapes the loss landscape of self-supervised learning?
- Computer ScienceArXiv
- 2022
In this theory, the causes of the dimensional collapse are identified, the effect of normalization and bias is studied, and the interpretability afforded by the analytical theory is leveraged to understand how dimensional collapse can be beneficial and what affects the robustness of SSL against data imbalance.
Exact Phase Transitions in Deep Learning
- Computer ScienceArXiv
- 2022
It is proved that the competition between prediction error and model complexity in the training loss leads to the second-order phase transition for nets with one hidden layer and the first-orderphase transition fornets with more than onehidden layer.
Posterior Collapse of a Linear Latent Variable Model
- Computer ScienceNeurIPS
- 2022
The existence and cause of a type of posterior collapse that frequently occurs in the Bayesian deep learning practice are identified and the result suggests that posterior collapse may be related to neural collapse and dimensional collapse and could be a subclass of a general problem of learning for deeper architectures.
49 References
Deep Learning without Poor Local Minima
- Computer ScienceNIPS
- 2016
This paper proves a conjecture published in 1989 and partially addresses an open problem announced at the Conference on Learning Theory (COLT) 2015, and presents an instance for which it can answer the following question: how difficult is it to directly train a deep model in theory?
Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias
- Computer ScienceICLR
- 2022
It is shown that for linear encoders/decoders, the conjecture is true—that is the VAE training does recover a generator with support equal to the ground truth manifold—and does so due to an implicit bias of gradient descent rather than merely theVAE loss itself.
Surprises in High-Dimensional Ridgeless Least Squares Interpolation
- Computer ScienceAnnals of statistics
- 2022
This paper recovers-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.
Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global
- Computer Science, MathematicsICML
- 2018
This work provides a short and elementary proof of the fact that all local minima are global minima if the hidden layers are either at least as wide as the input layer, or at least at least widest as the output layer.
Depth Creates No Bad Local Minima
- Computer ScienceArXiv
- 2017
It is proved that without nonlinearity, depth alone does not create bad local minima, although it induces non-convex loss surface.
Stochastic neural networks
- Computer ScienceAlgorithmica
- 2005
A class of algorithms for finding the global minimum of a continuous-variable function defined on a hypercube, based on both diffusion processes and simulated annealing, are presented, and it is shown that “learning” in these networks can be achieved by a set of three interconnected diffusion machines.
Auto-Encoding Variational Bayes
- Computer ScienceICLR
- 2014
A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced.
What shapes the loss landscape of self-supervised learning?
- Computer ScienceArXiv
- 2022
In this theory, the causes of the dimensional collapse are identified, the effect of normalization and bias is studied, and the interpretability afforded by the analytical theory is leveraged to understand how dimensional collapse can be beneficial and what affects the robustness of SSL against data imbalance.
Stochastic Neural Networks with Infinite Width are Deterministic
- Computer ScienceArXiv
- 2022
It is proved that as the width of an optimized stochastic neural network tends to infinity, its predictive variance on the training set decreases to zero, which helps better understand how stochastically affects the learning of neural networks and potentially design better architectures for practical problems.
Sparsity by Redundancy: Solving L1 with a Simple Reparametrization
- Computer ScienceArXiv
- 2022
The results lead to a simple algorithm, spred, that seamlessly integrates L 1 regularization into any modern deep learning framework, and bridges the gap in understanding the inductive bias of the redundant parametrization common in deep learning and conventional statistical learning.