Corpus ID: 195847882

Are deep ResNets provably better than linear predictors?

@inproceedings{Yun2019AreDR,
  title={Are deep ResNets provably better than linear predictors?},
  author={Chulhee Yun and S. Sra and A. Jadbabaie},
  booktitle={NeurIPS},
  year={2019}
}
Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residual blocks. We take a step towards extending this result to deep ResNets. We start by two motivating… Expand
BEYOND: TOWARDS PROVABLE OPTIMIZATION VIA OVERPARAMETERIZATION FROM DEPTH
To understand the success of SGD achieving zero training loss for training deep neural networks, this work presents a mean-field analysis of deep residual networks, based on a line of works thatExpand
Why ResNet Works? Residuals Generalize
TLDR
According to the obtained generalization bound, regularization terms should be introduced to control the magnitude of the norms of weight matrices not to increase too much, in practice, to ensure a good generalization ability, which justifies the technique of weight decay. Expand
Is the Skip Connection Provable to Reform the Neural Network Loss Landscape?
TLDR
It is theoretically proved that the skip connection network inherits the good properties of the two-layer network and skip connections can help to control the connectedness of the sub-level sets, such that any local minima worse than the global minima of some two- layer ReLU network will be very ``shallow". Expand
Is Supervised Learning With Adversarial Features Provably Better Than Sole Supervision?
  • Litu Rout
  • Computer Science, Mathematics
  • ArXiv
  • 2019
TLDR
This paper shows that supervised learning without adversarial features suffer from vanishing gradient issue in near optimal region, and analyzes how adversarial learning augmented with supervised signal mitigates this vanishing gradientissue. Expand
An Interpretable Framework for Drug-Target Interaction with Gated Cross Attention
In silico prediction of drug-target interactions (DTI) is significant for drug discovery because it can largely reduce timelines and costs in the drug development process. Specifically, deepExpand
Computer Vision Based Two-stage Waste Recognition-Retrieval Algorithm for Waste Classification
TLDR
A novel two-stage Waste Recognition-Retrieval algorithm (W2R) is proposed to classify domestic waste via computer vision and sort it automatically according to the four-category regulation. Expand
A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable Optimization Via Overparameterization From Depth
TLDR
A mean-field analysis of deep residual networks, based on a line of works that interpret the continuum limit of the deep residual network as an ordinary differential equation when the network capacity tends to infinity, and proposes a new continuum limit, which enjoys a good landscape in the sense that every local minimizer is global. Expand

References

SHOWING 1-10 OF 32 REFERENCES
Are ResNets Provably Better than Linear Predictors?
  • O. Shamir
  • Computer Science, Mathematics
  • NeurIPS
  • 2018
TLDR
It is rigorously proved that arbitrarily deep, nonlinear residual units indeed exhibit this behavior, in the sense that the optimization landscape contains no local minima with value above what can be obtained with a linear predictor (namely a 1-layer network). Expand
Identity Matters in Deep Learning
TLDR
This work gives a strikingly simple proof that arbitrarily deep linear residual networks have no spurious local optima and shows that residual networks with ReLu activations have universal finite-sample expressivity in the sense that the network can represent any function of its sample provided that the model has more parameters than the sample size. Expand
Local minima in training of neural networks
TLDR
It is demonstrated that in this scenario one can construct counter-examples (datasets or initialization schemes) when the network does become susceptible to bad local minima over the weight space. Expand
Deep Learning without Poor Local Minima
In this paper, we prove a conjecture published in 1989 and also partially address an open problem announced at the Conference on Learning Theory (COLT) 2015. With no unrealistic assumption, we firstExpand
Identity Mappings in Deep Residual Networks
TLDR
The propagation formulations behind the residual building blocks suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. Expand
Depth with Nonlinearity Creates No Bad Local Minima in ResNets
In this paper, we prove that depth with nonlinearity creates no bad local minima in a type of arbitrarily deep ResNets with arbitrary nonlinear activation functions, in the sense that the values ofExpand
Small nonlinearities in activation functions create bad local minima in neural networks
TLDR
The results indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust, and a comprehensive characterization of global optimality for deeplinear networks is presented, which unifies other results on this topic. Expand
Deep Residual Learning for Image Recognition
TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. Expand
Diverse Neural Network Learns True Target Functions
TLDR
This paper analyzes one-hidden-layer neural networks with ReLU activation, and shows that despite the non-convexity, Neural networks with diverse units have no spurious local minima and suggests a novel regularization function to promote unit diversity for potentially better generalization. Expand
...
1
2
3
4
...