What Kinds of Functions do Deep Neural Networks Learn? Insights from Variational Spline Theory

@article{Parhi2022WhatKO,
  title={What Kinds of Functions do Deep Neural Networks Learn? Insights from Variational Spline Theory},
  author={Rahul Parhi and Robert D. Nowak},
  journal={ArXiv},
  year={2022},
  volume={abs/2105.03361}
}
We develop a variational framework to understand the properties of functions learned by fitting deep neural networks with rectified linear unit activations to data. We propose a new function space, which is reminiscent of classical bounded variation-type spaces, that captures the compositional structure associated with deep neural networks. We derive a representer theorem showing that deep ReLU networks are solutions to regularized data fitting problems over functions from this space. The… 

Figures from this paper

Near-Minimax Optimal Estimation With Shallow ReLU Neural Networks
TLDR
Light is shed on the phenomenon that neural networks seem to break the curse of dimensionality and derives a minimax lower bound for the estimation problem for this function space and shows that the neural network estimators are minimax optimal up to logarithmic factors.
Deep Learning meets Nonparametric Regression: Are Weight-Decayed DNNs Locally Adaptive?
We study the theory of neural network (NN) from the lens of classical nonparametric regression problems with a focus on NN’s ability to adaptively estimate functions with heterogeneous smoothness — a
Explicit representations for Banach subspaces of Lizorkin distributions
The Lizorkin space is well-suited for studying various operators; e.g., fractional Laplacians and the Radon transform. In this paper, we show that the space is unfortunately not complemented in the
Qualitative neural network approximation over R and C: Elementary proofs for analytic and polynomial activation
TLDR
This article proves for both real and complex networks with non-polynomial activation that the closure of the class of neural networks coincides with theclosure of the space of polynomials, and proves approximation theorems in classes of deep and shallow neural networks with analytic activation functions by elementary arguments.
Sparsest Univariate Learning Models Under Lipschitz Constraint
TLDR
This work proposes continuous-domain formulations for one-dimensional regression problems that admit global minimizers that are continuous and piecewise-linear (CPWL) functions and proposes efficient algorithms that find the sparsest solution of each problem: the CPWL mapping with the least number of linear regions.
The Directional Bias Helps Stochastic Gradient Descent to Generalize in Kernel Regression Models
TLDR
The directional bias property of SGD, which is known in the linear regression setting, is generalized to the kernel regression and it is proved that SGD with moderate and annealing step-size converges along the direction of the eigenvector that corresponds to the largest eigenvalue of the Gram matrix.
Characterization of the Variation Spaces Corresponding to Shallow Neural Networks
We consider the variation space corresponding to a dictionary of functions in $L^2(\Omega)$ and present the basic theory of approximation in these spaces. Specifically, we compare the definition
Connections between Numerical Algorithms for PDEs and Neural Networks
TLDR
This work investigates numerous structural connections between numerical algorithms for partial differential equations (PDEs) and neural architectures and presents U-net architectures that implement multigrid techniques for learning efficient solutions of partial differential equation models, and motivate uncommon design choices such as trainable nonmonotone activation functions.
Deep Quantile Regression: Mitigating the Curse of Dimensionality Through Composition
TLDR
The results show that the DQR estimator has an oracle property in the sense that it achieves the nonparametric minimax optimal rate determined by the intrinsic dimension of the underlying compositional structure of the conditional quantile function, not the ambientdimension of the predictor.
From boundaries to bumps: When closed (extremal) contours are critical
TLDR
This work relaxes the notion of occluding contour and, more accurately, the rim on the object that projects to it, to define closed extremal curves, which are biologically computable, unify shape inferences from shading and specular materials, and predict new phenomena in bump and dent perception.
...
1
2
...

References

SHOWING 1-10 OF 61 REFERENCES
Deep Neural Networks With Trainable Activations and Controlled Lipschitz Constant
TLDR
It is proved that there always exists a solution that has continuous and piecewise-linear (linear-spline) activations and an $\ell _1$ penalty on the parameters of the activations favors the learning of sparse nonlinearities.
Learning Activation Functions in Deep (Spline) Neural Networks
TLDR
An efficient computational solution to train deep neural networks with free-form activation functions by using an equivalent B-spline basis to encode the activation functions and by expressing the regularization as an $\ell _1$-penalty.
A representer theorem for deep neural networks
  • M. Unser
  • Computer Science
    J. Mach. Learn. Res.
  • 2019
TLDR
A general representer theorem for deep neural networks is derived that makes a direct connection with splines and sparsity, and it is shown that the optimal network configuration can be achieved with activation functions that are nonuniform linear splines with adaptive knots.
A Unifying Representer Theorem for Inverse Problems and Machine Learning
  • M. Unser
  • Mathematics
    Found. Comput. Math.
  • 2021
TLDR
A general representer theorem is presented that characterizes the solutions of a remarkably broad class of optimization problems and is used to retrieve a number of known results in the literature---e.g., the celebrated representser theorem of machine leaning for RKHS, Tikhonov regularization, representer theorems for sparsity promoting functionals, the recovery of spikes.
Are wider nets better given the same number of parameters?
TLDR
It is shown that for models initialized with a random, static sparsity pattern in the weight tensors, network width is the determining factor for good performance, while the number of weights is secondary, as long as trainability is ensured.
Banach Space Representer Theorems for Neural Networks and Ridge Splines
TLDR
A variational framework to understand the properties of the functions learned by neural networks fit to data and a representer theorem showing that finite-width, single-hidden layer neural networks are solutions to inverse problems with total variation-like regularization is derived.
Mad Max: Affine Spline Insights Into Deep Learning
TLDR
A rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators is built and a simple penalty term is proposed that can be added to the cost function of any DN learning algorithm to force the templates to be orthogonal with each other.
Pufferfish: Communication-efficient Models At No Extra Cost
TLDR
PUFFERFISH is a communication and computation efficient distributed training framework that incorporates the gradient compression into the model training process via training low-rank, pre-factorized deep networks and leads to equally accurate, small-parameter models while avoiding the burden of “winning the lottery”.
Pufferfish: Communication-efficient models at no 26 R
  • PARHI AND R. D. NOWAK extra cost, Proceedings of Machine Learning and Systems, 3
  • 2021
...
1
2
3
4
5
...