• Corpus ID: 208158013

Information-Theoretic Local Minima Characterization and Regularization

  title={Information-Theoretic Local Minima Characterization and Regularization},
  author={Zhiwei Jia and Hao Su},
Recent advances in deep learning theory have evoked the study of generalizability across different local minima of deep neural networks (DNNs). While current work focused on either discovering properties of good local minima or developing regularization techniques to induce good local minima, no approach exists that can tackle both problems. We achieve these two goals successfully in a unified manner. Specifically, based on the observed Fisher information we propose a metric both strongly… 

Figures and Tables from this paper

AlterSGD: Finding Flat Minima for Continual Learning by Alternative Training
A simple yet effective optimization method, called AlterSGD, to search for a flat minima in the loss landscape, which can significantly mitigate the forgetting and outperform the state-of-the-art methods with a large margin under challenging continual learning protocols.
Semantically Robust Unpaired Image Translation for Data with Unmatched Semantics Statistics
This work proposes to enforce the translated outputs of unpaired image-to-image translation to be semantically invariant w.r.t. small perceptual variations of the inputs, a property it calls "semantic robustness".
Lipschitz Regularized CycleGAN for Improving Semantic Robustness in Unpaired Image-to-image Translation
This paper proposes a novel approach, Lipschitz regularized CycleGAN, for improving semantic robustness and thus alleviating the semantic flipping issue, and adds a gradient penalty loss to the generators, which encourages semantically consistent transformations.
The power of quantum neural networks
This work is the first to demonstrate that well-designed quantum neural networks offer an advantage over classical neural networks through a higher effective dimension and faster training ability, which is verified on real quantum hardware.
Dissecting Non-Vacuous Generalization Bounds based on the Mean-Field Approximation
This work shows empirically that PAC-Bayes bounds optimized using variational inference gives negligible gains when modeling the posterior as a Gaussian with diagonal covariance--known as the mean-field approximation.


Densely Connected Convolutional Networks
The Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion, and has several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.
Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach
Novel statistics of FIM are revealed that are universal among a wide class of DNNs and can be connected to a norm-based capacity measure of generalization ability and quantitatively estimate an appropriately sized learning rate for gradient methods to converge.
Train longer, generalize better: closing the generalization gap in large batch training of neural networks
This work proposes a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior and presents a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates.
Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape and compares favorably to state-of-the-art techniques in terms of generalization error and training time.
Robust Large Margin Deep Neural Networks
The analysis leads to the conclusion that a bounded spectral norm of the network's Jacobian matrix in the neighbourhood of the training samples is crucial for a deep neural network of arbitrary depth and width to generalize well.
Wide Residual Networks
This paper conducts a detailed experimental study on the architecture of ResNet blocks and proposes a novel architecture where the depth and width of residual networks are decreased and the resulting network structures are called wide residual networks (WRNs), which are far superior over their commonly used thin and very deep counterparts.
Deep Residual Learning for Image Recognition
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Learning Multiple Layers of Features from Tiny Images
It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.
Gradient Descent Finds Global Minima of Deep Neural Networks
The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet) and extends the analysis to deep residual convolutional neural networks and obtains a similar convergence result.
Empirical Analysis of the Hessian of Over-Parametrized Neural Networks
A case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but are in fact connected through their flat region and so belong to the same basin.