• Corpus ID: 236957308

Expressive Power and Loss Surfaces of Deep Learning Models

@article{Dube2021ExpressivePA,
  title={Expressive Power and Loss Surfaces of Deep Learning Models},
  author={Simant Dube},
  journal={ArXiv},
  year={2021},
  volume={abs/2108.03579}
}
  • S. Dube
  • Published 8 August 2021
  • Computer Science
  • ArXiv
The goals of this paper are two-fold. The first goal is to serve as an expository tutorial on the working of deep learning models which emphasizes geometrical intuition about the reasons for success of deep learning. The second goal is to complement the current results on the expressive power of deep learning models and their loss surfaces with novel insights and results. In particular, we describe how deep neural networks carve out manifolds especially when the multiplication neurons are… 

References

SHOWING 1-10 OF 23 REFERENCES

On the Expressive Power of Deep Neural Networks

We propose a new approach to the problem of neural network expressivity, which seeks to characterize how structural properties of a neural network family affect the functions it is able to compute.

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

It is shown that the optima of these complex loss functions are in fact connected by simple curves over which training and test accuracy are nearly constant, and a training procedure is introduced to discover these high-accuracy pathways between modes.

On the Number of Linear Regions of Deep Neural Networks

We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep

Normalized Attention Without Probability Cage

This work highlights the limitations of constraining attention weights to the probability simplex and the resulting convex hull of value vectors and proposes to replace the softmax in self-attention with normalization, yielding a hyperparameter and data-bias robust, generally applicable architecture.

Rethinking Attention with Performers

Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear space and time complexity, without relying on any priors such as sparsity or low-rankness are introduced.

Tropical Geometry of Deep Neural Networks

It is deduced that feedforward ReLU neural networks with one hidden layer can be characterized by zonotopes, which serve as building blocks for deeper networks, and it is proved that linear regions of such neural networks correspond to vertices of polytopes associated with tropical rational functions.

The Loss Surfaces of Multilayer Networks

It is proved that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.

Measuring the Intrinsic Dimension of Objective Landscapes

Intrinsic dimension allows some quantitative comparison of problem difficulty across supervised, reinforcement, and other types of learning where it is concluded that solving the inverted pendulum problem is 100 times easier than classifying digits from MNIST, and playing Atari Pong from pixels is about as hard as classifying CIFAR-10.

Qualitatively characterizing neural network optimization problems

A simple analysis technique is introduced to look for evidence that state-of-the-art neural networks are overcoming local optima, and finds that, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

This paper proposes a new approach to second-order optimization, the saddle-free Newton method, that can rapidly escape high dimensional saddle points, unlike gradient descent and quasi-Newton methods, and applies this algorithm to deep or recurrent neural network training, and provides numerical evidence for its superior optimization performance.