Corpus ID: 16174239

Stuck in a What? Adventures in Weight Space

  title={Stuck in a What? Adventures in Weight Space},
  author={Zachary Chase Lipton},
Deep learning researchers commonly suggest that converged models are stuck in local minima. More recently, some researchers observed that under reasonable assumptions, the vast majority of critical points are saddle points, not true minima. Both descriptions suggest that weights converge around a point in weight space, be it a local optima or merely a critical point. However, it’s possible that neither interpretation is accurate. As neural networks are typically over-complete, it’s easy to show… Expand
On the Flatness of Loss Surface for Two-layered ReLU Networks
It is proved that two-layered ReLU networks can keep the loss of a critical point invariant, thus can incur flat regions and how to escape from flat regions is vital in training neural networks. Expand
Landscape and training regimes in deep learning
Abstract Deep learning algorithms are responsible for a technological revolution in a variety of tasks including image recognition or Go playing. Yet, why they work is not understood. Ultimately,Expand
Luck Matters: Understanding Training Dynamics of Deep ReLU Networks
Using a teacher-student setting, a novel relationship between the gradient received by hidden student nodes and the activations of teacher nodes for deep ReLU networks is discovered and it is proved that student nodes whose weights are initialized to be close to teacher nodes converge to them at a faster rate. Expand
The jamming transition as a paradigm to understand the loss landscape of deep neural networks
It is argued that in fully connected deep networks a phase transition delimits the over- and underparametrized regimes where fitting can or cannot be achieved, and observed that the ability of fully connected networks to fit random data is independent of their depth, an independence that appears to also hold for real data. Expand
  • 2017
Neural network training relies on our ability to find “good” minimizers of highly non-convex loss functions. It is well known that certain network architecture designs (e.g., skip connections)Expand
Visualizing the Loss Landscape of Neural Nets
This paper introduces a simple "filter normalization" method that helps to visualize loss function curvature and make meaningful side-by-side comparisons between loss functions, and explores how network architecture affects the loss landscape, and how training parameters affect the shape of minimizers. Expand
Scaling description of generalization with number of parameters in deep learning
This work relies on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations that affect the generalization error of neural networks. Expand
Perspective: A Phase Diagram for Deep Learning unifying Jamming, Feature Learning and Lazy Training
It is argued that different learning regimes can be organized into a phase diagram and a line of critical points sharply delimits an under-parametrised phase from an over-parametrized one, and learning can operate in two regimes separated by a smooth cross-over. Expand
Comparing Dynamics: Deep Neural Networks versus Glassy Systems
Numerically the training dynamics of deep neural networks (DNN) are analyzed by using methods developed in statistical physics of glassy systems to suggest that during the training process the dynamics slows down because of an increasingly large number of flat directions. Expand
Classifying the classifier: dissecting the weight space of neural networks
An empirical study on the weights of neural networks, where each model is interpreted as a point in a high-dimensional space -- the neural weight space -- and how meta-classifiers can reveal a great deal of information about the training setup and optimization, by only considering a small subset of randomly selected consecutive weights is shown. Expand


Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
This paper proposes a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods, and applies this algorithm to deep or recurrent neural network training, and provides numerical evidence for its superior optimization performance. Expand
Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods
This work proposes a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. Expand
Qualitatively characterizing neural network optimization problems
A simple analysis technique is introduced to look for evidence that state-of-the-art neural networks are overcoming local optima, and finds that, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles. Expand
The mnist database of handwritten digits
Disclosed is an improved articulated bar flail having shearing edges for efficiently shredding materials. An improved shredder cylinder is disclosed with a plurality of these flails circumferentiallyExpand