Why Approximate Matrix Square Root Outperforms Accurate SVD in Global Covariance Pooling?

  title={Why Approximate Matrix Square Root Outperforms Accurate SVD in Global Covariance Pooling?},
  author={Yue Song and N. Sebe and Wei Wang},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
  • Yue SongN. SebeWei Wang
  • Published 6 May 2021
  • Computer Science
  • 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Global Covariance Pooling (GCP) aims at exploiting the second-order statistics of the convolutional feature. Its effectiveness has been demonstrated in boosting the classification performance of Convolutional Neural Networks (CNNs). Singular Value Decomposition (SVD) is used in GCP to compute the matrix square root. However, the approximate matrix square root calculated using Newton-Schulz iteration [14] outperforms the accurate one computed via SVD [15]. We empirically analyze the reason… 

Tables from this paper

Fast Differentiable Matrix Square Root and Inverse Square Root

Two more efficient variants to compute the differentiable matrix square root and the inverse square root are proposed and validated in several real-world applications.

On the Eigenvalues of Global Covariance Pooling for Fine-grained Visual Recognition

A network branch dedicated to magnifying the importance of small eigenvalues is proposed that achieves state-of-the-art performances of GCP methods on three fine-grained benchmarks and is also competitive against other FGVC approaches on larger datasets.

Fast Differentiable Matrix Square Root

Two more efficient variants to compute the differentiable matrix square root are proposed to use Matrix Taylor Polynomial (MTP) and Matrix Padé Approximants (MPA) and yield considerable speed-up compared with the SVD or the Newton-Schulz iteration.

Orthogonal SVD Covariance Conditioning and Latent Disentanglement

This paper systematically study how to improve the covariance conditioning by enforcing orthogonality to the Pre-SVD layer by proposing the Nearest Orthogonal Gradient (NOG) and Optimal Learning Rate (OLR).

Batch-efficient EigenDecomposition for Small and Medium Matrices

This paper proposes a QR-based EigenDecomposition method that performs the ED entirely by batched matrix/vector multiplication, which processes all the matrices simultaneously and thus fully utilizes the power of GPUs.

Improving Covariance Conditioning of the SVD Meta-layer by Orthogonality

This paper systematically study how to improve the covariance conditioning by enforcing orthogonality to the Pre-SVD layer by proposing the Nearest Orthogonal Gradient (NOG) and Optimal Learning Rate (OLR).

Grouping-matrix based Graph Pooling with Adaptive Number of Clusters

This work proposes GMPOOL, a novel differentiable graph pooling architecture that automatically determines the appropriate number of clusters based on the input data and outperforms conventional methods on molecular property prediction tasks.

Convolutional Fine-Grained Classification With Self-Supervised Target Relation Regularization

Inspired by recent success of the mixup style data augmentation, randomness is introduced into soft construction of dynamic target relation graphs to further explore relation diversity of target classes.

A new stable and avoiding inversion iteration for computing matrix square root

The high computational efficiency and accuracy of the proposed method are demonstrated by computing the principal square roots of different matrices to reveal its applicability over the existing methods.

WISEFUSE: Workload Characterization and DAG Transformation for Serverless Workflows

This work proposes WISEFUSE, an automated approach to generate an optimized execution plan for serverless DAGs for a user-specified latency objective or budget and implements it experimentally, showing significant improvements in E2E latency and cost.



Backpropagation-Friendly Eigendecomposition

This paper introduces a numerically stable and differentiable approach to leveraging eigenvectors in deep networks, which can handle large matrices without requiring to split them and introduces PCA denoising, which is introduced as a new normalization strategy for deep networks.

Towards Faster Training of Global Covariance Pooling Networks by Iterative Matrix Square Root Normalization

This work proposes an iterative matrix square root normalization method for fast end-to-end training of global covariance pooling networks, which is much faster than EIG or SVD based methods, since it involves only matrix multiplications, suitable for parallel implementation on GPU.

Introductory Lectures on Convex Optimization - A Basic Course

It was in the middle of the 1980s, when the seminal paper by Kar markar opened a new epoch in nonlinear optimization, and it became more and more common that the new methods were provided with a complexity analysis, which was considered a better justification of their efficiency than computational experiments.

Improved Bilinear Pooling with CNNs

This paper investigates various ways of normalizing second-order statistics of convolutional features to improve their representation power and finds that the matrix square-root normalization offers significant improvements and outperforms alternative schemes such as the matrix logarithm normalization when combined with elementwisesquare-root and l2 normalization.

Matrix Backpropagation for Deep Networks with Structured Layers

A sound mathematical apparatus to formally integrate global structured computation into deep computation architectures and demonstrates that deep networks relying on second-order pooling and normalized cuts layers, trained end-to-end using matrix backpropagation, outperform counterparts that do not take advantage of such global layers.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Bilinear CNN Models for Fine-Grained Visual Recognition

We propose bilinear models, a recognition architecture that consists of two feature extractors whose outputs are multiplied using outer product at each location of the image and pooled to obtain an

ImageNet: A large-scale hierarchical image database

A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

An Investigation Into the Stochasticity of Batch Whitening

This paper quantitatively investigates the stochasticity of different whitening transformations and shows that it correlates well with the optimization behaviors during training, and provides a framework for designing and comparing BW algorithms in different scenarios.

What Deep CNNs Benefit From Global Covariance Pooling: An Optimization Perspective

  • Qilong WangLi Zhang Q. Hu
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
This paper explores the effect of GCP on deep CNNs in terms of the Lipschitzness of optimization loss and the predictiveness of gradients, and shows that GCP can make the optimization landscape more smooth and the gradients more predictive.