Statistical Guarantees for Regularized Neural Networks

  title={Statistical Guarantees for Regularized Neural Networks},
  author={Mahsa Taheri and Fang Xie and Johannes Lederer},
  journal={Neural networks : the official journal of the International Neural Network Society},
Neural networks have become standard tools in the analysis of data, but they lack comprehensive mathematical theories. For example, there are very few statistical guarantees for learning neural networks from data, especially for classes of estimators that are used in practice or at least similar to such. In this paper, we develop a general statistical guarantee for estimators that consist of a least-squares term and a regularizer. We then exemplify this guarantee with ℓ1-regularization, showing… Expand
Risk Bounds for Robust Deep Learning
This paper shows that empirical-risk minimization with unbounded, Lipschitz-continuous loss functions, such as the least-absolute deviation loss, Huber loss, Cauchy loss, and Tukey's biweight loss, can provide efficient prediction under minimal assumptions on the data. Expand
Hierarchical Adaptive Lasso: Learning Sparse Neural Networks with Shrinkage via Single Stage Training
A novel penalty called Hierarchical Adaptive Lasso (HALO) which learns to adaptively sparsify weights of a given network via trainable parameters without learning a mask is presented. Expand
No Spurious Local Minima: on the Optimization Landscapes of Wide and Deep Neural Networks
These theories substantiate the common belief that increasing network widths not only improves the expressiveness of deep-learning pipelines but also facilitates their optimizations, and prove especially that constraint and unconstraint empirical-risk minimization over such networks has no spurious local minima. Expand
Analytic function approximation by path norm regularized deep networks
An entropy bound is provided for the spaces of path norm regularized neural networks with piecewise linear activation functions, such as the ReLU and the absolute value functions that are analytic on certain regions of C. Expand
Deep neural network approximation of analytic functions
An oracle inequality for the expected error of the considered penalized deep neural network estimators is derived from ε-approximate functions that are analytic on certain regions of C. Expand
Function approximation by deep neural networks with parameters $\{0,\pm \frac{1}{2}, \pm 1, 2\}$
It is shown that C_\beta-smooth functions can be approximated by neural networks with parameters and the nonparametric regression estimation with constructed networks attain the same convergence rate as with the sparse networks withParameters. Expand
HALO: Learning to Prune Neural Networks with Shrinkage
A novel penalty called Hierarchical Adaptive Lasso (HALO) which learns to adaptively sparsify weights of a given network via trainable parameters is presented which is able to learn highly sparse network with significant gains in performance over state-of-the-art magnitude pruning methods at the same level of sparsity. Expand
Neural networks with superexpressive activations and integer weights
The range of integer weights required for ε-approximation of Hölder continuous functions is derived, which leads to a convergence rate of order n −2β 2β+d log 2 n for neural network regression estimation of unknown β-Hölder continuously function with given n samples. Expand
Regularization and Reparameterization Avoid Vanishing Gradients in Sigmoid-Type Networks
This paper revisits the vanishing-gradient problem in the context of sigmoid-type activation and uses mathematical arguments to highlight two different sources of the phenomenon, namely large individual parameters and effects across layers, and to illustrate two simple remedies, namely regularization and rescaling. Expand
Layer Sparsity in Neural Networks
A new notion of sparsity is formulated that concerns the networks' layers and, therefore, aligns particularly well with the current trend toward deep networks, and is called layer sparsity. Expand


Approximation and Estimation for High-Dimensional Deep Learning Networks
The heart of the analysis is the development of a sampling strategy that demonstrates the accuracy of a sparse covering of deep ramp networks, and lower bounds show that the identified risk is close to being optimal. Expand
Sparse-Input Neural Networks for High-dimensional Nonparametric Regression and Classification
Neural networks are usually not the tool of choice for nonparametric high-dimensional problems where the number of input features is much larger than the number of observations. Though neuralExpand
L1-regularized Neural Networks are Improperly Learnable in Polynomial Time
A kernel-based method, such that with probability at least 1 - δ, it learns a predictor whose generalization error is at most e worse than that of the neural network, implies that any sufficiently sparse neural network is learnable in polynomial time. Expand
On the rate of convergence of fully connected very deep neural network regression estimates
This paper shows that it is possible to get similar results also for least squares estimates based on simple fully connected neural networks with ReLU activation functions, based on new approximation results concerning deep neural networks. Expand
High-Dimensional Learning under Approximate Sparsity: A Unifying Framework for Nonsmooth Learning and Regularized Neural Networks
High-dimensional statistical learning (HDSL) has been widely applied in data analysis, operations research, and stochastic optimization. Despite the availability of multiple theoretical frameworks,Expand
Group sparse regularization for deep neural networks
The group Lasso penalty is extended, originally proposed in the linear regression literature, to impose group-level sparsity on the networks connections, where each group is defined as the set of outgoing weights from a unit. Expand
Nonparametric regression using deep neural networks with ReLU activation function
Consider the multivariate nonparametric regression model. It is shown that estimators based on sparsely connected deep neural networks with ReLU activation function and properly chosen networkExpand
Neural Network Learning - Theoretical Foundations
The authors explain the role of scale-sensitive versions of the Vapnik Chervonenkis dimension in large margin classification, and in real prediction, and discuss the computational complexity of neural network learning. Expand
Implicit Regularization in Deep Learning
It is shown that implicit regularization induced by the optimization method is playing a key role in generalization and success of deep learning models, and how different complexity measures can ensure generalization is studied to explain different observed phenomena in deep learning. Expand
Complexity, Statistical Risk, and Metric Entropy of Deep Nets Using Total Path Variation
For any ReLU network there is a representation in which the sum of the absolute values of the weights into each node is exactly $1$, and the input layer variables are multiplied by a value $V$Expand