DebiNet: Debiasing Linear Models with Nonlinear Overparameterized Neural Networks
@inproceedings{Xu2021DebiNetDL, title={DebiNet: Debiasing Linear Models with Nonlinear Overparameterized Neural Networks}, author={Shiyun Xu}, booktitle={AISTATS}, year={2021} }
Recent years have witnessed strong empirical performance of over-parameterized neural networks on various tasks and many advances in the theory, e.g. the universal approximation and provable convergence to global minimum. In this paper, we incorporate over-parameterized neural networks into semi-parametric models to bridge the gap between inference and prediction, especially in the high dimensional linear problem. By doing so, we can exploit a wide class of networks to approximate the nuisance…
Figures and Tables from this paper
References
SHOWING 1-10 OF 59 REFERENCES
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
- Computer ScienceNeurIPS
- 2019
This work shows that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
- Computer ScienceNeurIPS
- 2019
It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples.
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
- Computer ScienceICLR
- 2019
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum.
A Convergence Theory for Deep Learning via Over-Parameterization
- Computer ScienceICML
- 2019
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.
Sensitivity and Generalization in Neural Networks: an Empirical Study
- Computer ScienceICLR
- 2018
It is found that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization.
Breaking the Curse of Dimensionality with Convex Neural Networks
- Computer ScienceJ. Mach. Learn. Res.
- 2017
This work considers neural networks with a single hidden layer and non-decreasing homogeneous activa-tion functions like the rectified linear units and shows that they are adaptive to unknown underlying linear structures, such as the dependence on the projection of the input variables onto a low-dimensional subspace.
Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings
- Computer ScienceNeural Networks
- 1990
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
- Computer ScienceICML
- 2019
This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure.
The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study
- Computer ScienceICML
- 2019
It is found that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions, and in the absence of batch normalization, the optimal normalized noise scale is directly proportional to width.
Double/Debiased Machine Learning for Treatment and Structural Parameters
- Computer Science
- 2017
This work revisits the classic semiparametric problem of inference on a low dimensional parameter θ_0 in the presence of high-dimensional nuisance parameters η_0 and proves that DML delivers point estimators that concentrate in a N^(-1/2)-neighborhood of the true parameter values and are approximately unbiased and normally distributed, which allows construction of valid confidence statements.