• Corpus ID: 226227413

DebiNet: Debiasing Linear Models with Nonlinear Overparameterized Neural Networks

@inproceedings{Xu2021DebiNetDL,
  title={DebiNet: Debiasing Linear Models with Nonlinear Overparameterized Neural Networks},
  author={Shiyun Xu},
  booktitle={AISTATS},
  year={2021}
}
  • Shiyun Xu
  • Published in AISTATS 1 November 2020
  • Computer Science
Recent years have witnessed strong empirical performance of over-parameterized neural networks on various tasks and many advances in the theory, e.g. the universal approximation and provable convergence to global minimum. In this paper, we incorporate over-parameterized neural networks into semi-parametric models to bridge the gap between inference and prediction, especially in the high dimensional linear problem. By doing so, we can exploit a wide class of networks to approximate the nuisance… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 59 REFERENCES
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
TLDR
This work shows that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
TLDR
It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples.
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
TLDR
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum.
A Convergence Theory for Deep Learning via Over-Parameterization
TLDR
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.
Sensitivity and Generalization in Neural Networks: an Empirical Study
TLDR
It is found that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization.
Breaking the Curse of Dimensionality with Convex Neural Networks
  • F. Bach
  • Computer Science
    J. Mach. Learn. Res.
  • 2017
TLDR
This work considers neural networks with a single hidden layer and non-decreasing homogeneous activa-tion functions like the rectified linear units and shows that they are adaptive to unknown underlying linear structures, such as the dependence on the projection of the input variables onto a low-dimensional subspace.
Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings
  • H. White
  • Computer Science
    Neural Networks
  • 1990
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
TLDR
This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure.
The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study
TLDR
It is found that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions, and in the absence of batch normalization, the optimal normalized noise scale is directly proportional to width.
Double/Debiased Machine Learning for Treatment and Structural Parameters
TLDR
This work revisits the classic semiparametric problem of inference on a low dimensional parameter θ_0 in the presence of high-dimensional nuisance parameters η_0 and proves that DML delivers point estimators that concentrate in a N^(-1/2)-neighborhood of the true parameter values and are approximately unbiased and normally distributed, which allows construction of valid confidence statements.
...
...