Trainability Preserving Neural Structured Pruning

  title={Trainability Preserving Neural Structured Pruning},
  author={Huan Wang and Yun Fu},
. Several recent works empirically find finetuning learning rate is critical to the final performance in neural network structured pruning. Further researches find that the network trainability broken by pruning answers for it, thus calling for an urgent need to recover trainability before finetuning. Existing attempts propose to exploit weight orthogonalization to achieve dynamical isometry for improved trainability. However, they only work for linear MLP networks. How to develop a filter pruning… 

Figures and Tables from this paper



Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Dynamical Isometry: The Missing Ingredient for Neural Network Pruning

Several recent works [40, 24] observed an interesting phenomenon in neural network pruning: A larger finetuning learning rate can improve the final performance significantly. Unfortunately, the

Importance Estimation for Neural Network Pruning

A novel method that estimates the contribution of a neuron (filter) to the final loss and iteratively removes those with smaller scores and two variations of this method using the first and second-order Taylor expansions to approximate a filter's contribution are described.

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.

CHEX: CHannel EXploration for CNN Model Compression

This paper proposes to repeatedly prune and regrow the channels throughout the training process, which reduces the risk of pruning important channels prematurely, and can effectively reduce the FLOPs of diverse CNN architectures on a variety of computer vision tasks.

Neural Pruning via Growing Regularization

This work proposes an L2 regularization variant with rising penalty factors and shows it can bring significant accuracy gains compared with its one-shot counterpart, even when the same weights are removed.

A Gradient Flow Framework For Analyzing Network Pruning

A general gradient flow based framework is developed that unifies state-of-the-art importance measures through the norm of model parameters and establishes several results related to pruning models early-on in training, including magnitude-based pruning, which preserves first-order model evolution dynamics and is appropriate for pruning minimally trained models.

Orthogonal Convolutional Neural Networks

The proposed orthogonal convolution requires no additional parameters and little computational overhead and consistently outperforms the kernel orthogonality alternative on a wide range of tasks such as image classification and inpainting under supervised, semi-supervised and unsupervised settings.

A Signal Propagation Perspective for Pruning Neural Networks at Initialization

By noting connection sensitivity as a form of gradient, this work formally characterize initialization conditions to ensure reliable connection sensitivity measurements, which in turn yields effective pruning results and modifications to the existing pruning at initialization method lead to improved results on all tested network models for image classification tasks.

Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers

This paper proposes a channel pruning technique for accelerating the computations of deep convolutional neural networks (CNNs) that focuses on direct simplification of the channel-to-channel computation graph of a CNN without the need of performing a computationally difficult and not-always-useful task.