Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNs

  title={Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNs},
  author={Md. Amirul Islam and Matthew Kowal and Sen Jia and Konstantinos G. Derpanis and Neil D. B. Bruce},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
In this paper, we challenge the common assumption that collapsing the spatial dimensions of a 3D (spatial-channel) tensor in a convolutional neural network (CNN) into a vector via global pooling removes all spatial information. Specifically, we demonstrate that positional information is encoded based on the ordering of the channel dimensions, while semantic information is largely not. Following this demonstration, we show the real world impact of these findings by applying them to two… 

Figures and Tables from this paper

Flamingo: a Visual Language Model for Few-Shot Learning

It is demonstrated that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples.

Position information attention networks for explosive mobile phone classification

This work contributes the first explosive mobile phone benchmark dataset for security screening, named explosive mobile phones x-ray image dataset, which will be publicly available and proposes a sample-oriented coefficient called sample cost with an update rule.

Spatial-Channel Transformer for Scene Recognition

The SC-Transformer is a simple yet effective module that uses a new attention mechanism by incorporating the importance between the spatial and the channel domain for a given scene image, and outperform the previous state-of-the-art spatial-channel attention mechanism.

Satellite component tracking and segmentation based on position information encoding

This paper proposes a position information encoding strategy to solve the problems of target loss and low light in the space environment during the overturning of satellite components, and improves the generalization ability of the model for image position information by embedding the position information matrix.



Position, Padding and Predictions: A Deeper Look at Position Information in CNNs

This paper shows that a surprising degree of absolute position information is encoded in commonly used CNNs, and shows that zero padding drives CNNs to encode position information in their internal representations, while a lack of padding precludes position encoding.

How Much Position Information Do Convolutional Neural Networks Encode?

A comprehensive set of experiments show the validity of the hypothesis that deep CNNs implicitly learn to encode absolute position information and shed light on how and where this information is represented while offering clues to where positional information is derived from in deepCNNs.

Gated Feedback Refinement Network for Dense Image Labeling

This paper proposes Gated Feedback Refinement Network (G-FRNet), an end-to-end deep learning framework for dense labeling tasks that addresses this limitation of existing methods and introduces gate units that control the information passed forward in order to filter out ambiguity.

Positional Encoding as Spatial Inductive Bias in GANs

This work shows that SinGAN's impressive capability in learning internal patch distribution, to a large extent, is brought by the implicit positional encoding when using zero padding in the generators, and proposes a new multi-scale training strategy and demonstrates its effectiveness in the state-of-the-art unconditional generator StyleGAN2.

Very Deep Convolutional Networks for Large-Scale Image Recognition

This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution

Preliminary evidence that swapping convolution for CoordConv can improve models on a diverse set of tasks is shown, which works by giving convolution access to its own input coordinates through the use of extra coordinate channels without sacrificing the computational and parametric efficiency of ordinary convolution.

Fully convolutional networks for semantic segmentation

The key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

This work addresses the task of semantic image segmentation with Deep Learning and proposes atrous spatial pyramid pooling (ASPP), which is proposed to robustly segment objects at multiple scales, and improves the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models.

Shape or Texture: Understanding Discriminative Features in CNNs

It is shown that a network learns the majority of overall shape information at the first few epochs of training and that this information is largely encoded in the last few layers of a CNN, as well as when the network learns about object shape during training.

Deep High-Resolution Representation Learning for Visual Recognition

The superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, is shown, suggesting that the HRNet is a stronger backbone for computer vision problems.