Differentiable Patch Selection for Image Recognition

  title={Differentiable Patch Selection for Image Recognition},
  author={Jean-Baptiste Cordonnier and Aravindh Mahendran and Alexey Dosovitskiy and Dirk Weissenborn and Jakob Uszkoreit and Thomas Unterthiner},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Neural Networks require large amounts of memory and compute to process high resolution images, even when only a small part of the image is actually informative for the task at hand. We propose a method based on a differentiable Top-K operator to select the most relevant parts of the input to efficiently process high resolution images. Our method may be interfaced with any downstream neural network, is able to aggregate information from different patches in a flexible way, and allows the whole… 

Efficient Classification of Very Large Images with Tiny Objects

This work presents an end-to-end CNN model termed Zoom-In network that leverages hierarchical attention sampling for classification of large images with tiny objects using a single GPU that achieves higher accuracy than existing methods while requiring less memory resources.

Localizing Semantic Patches for Accelerating Image Classification

This paper proposes an efficient image classification pipeline that first pinpoint task-aware regions over the input image by a lightweight patch proposal network called AnchorNet, and then feeds these localized semantic patches with much smaller spatial redundancy into a general classification network.

MIST: Multiple Instance Spatial Transformer

This work proposes a deep network that can be trained to tackle image reconstruction and classification problems that involve detection of multiple object instances, without any supervision regarding their whereabouts, and is able to learn to detect recurring structures in the training dataset by learning to reconstruct images.

Differentiable Zooming for Multiple Instance Learning on Whole-Slide Images

Z OOM MIL is proposed, a method that learns to perform multi-level zooming in an end-to-end manner that outperforms the state-of-the-art MIL methods in WSI classification on two large datasets, while reducing the computational demands with regard to Floating-Point Operations (FLOPs) and processing time by up to 40 × .

Recurrent Attention Models with Object-centric Capsule Representation for Multi-object Recognition

This work shows that using capsule networks to create an object-centric hidden representation in an encoder-decoder model with iterative glimpse attention yields effective integration of attention and recognition.

Detection of objects in the images: from likelihood relationships towards scalable and efficient neural networks

An attempt is made to systematically analyze trends in the development of approaches and detection methods, reasons behind these developments, as well as metrics designed to assess the quality and reliability of object detection.

A brain-inspired object-based attention network for multi-object recognition and visual reasoning

An encoder-decoder model inspired by the interacting bottom-up and top-down visual pathways making up the recognitionattention system in the brain achieves near-perfect accuracy and significantly outperforms larger models in generalizing to unseen stimuli.

ScoreNet: Learning Non-Uniform Attention and Augmentation for Transformer-Based Histopathological Image Classification

The proposed ScoreNet is a new efficient transformer that exploits a differentiable recommendation stage to extract discriminative image regions and dedicate computational resources accordingly, and introduces a novel mixing data-augmentation, namely ScoreMix, which mitigates the pitfalls of previous augmentations.

Efficient Video Transformers with Spatial-Temporal Token Selection

STTS is a token selection framework that dynamically selects a few informative tokens in both temporal and spatial dimensions conditioned on input video samples that achieves similar results while requiring 20% less computation.

Knowledge Mining with Scene Text for Fine-Grained Recognition

An end-to-end trainable network that mines im-plicit contextual knowledge behind scene text image and enhance the semantics and correlation to fine-tune the image representation is proposed.



Recurrent Models of Visual Attention

A novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution is presented.

Multiple Object Recognition with Visual Attention

The model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image and it is shown that the model learns to both localize and recognize multiple objects despite being given only class labels during training.

Spatial Transformer Networks

This work introduces a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network, and can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps.

Processing Megapixel Images with Deep Attention-Sampling Models

A fully differentiable end-to-end trainable model that samples and processes only a fraction of the full resolution input image and is evaluated on three classification tasks, where it allows to reduce computation and memory footprint by an order of magnitude for the same accuracy as classical architectures.

MIST: Multiple Instance Spatial Transformer Network

This work proposes a deep network that can be trained to tackle image reconstruction and classification problems that involve detection of multiple object instances, without any supervision regarding their whereabouts, and is able to learn to detect recurrent structures in the training dataset by learning to reconstruct images.

Bilinear CNN Models for Fine-Grained Visual Recognition

We propose bilinear models, a recognition architecture that consists of two feature extractors whose outputs are multiplied using outer product at each location of the image and pooled to obtain an

Picking Deep Filter Responses for Fine-Grained Image Recognition

This paper proposes an automatic fine-grained recognition approach which is free of any object / part annotation at both training and testing stages, and conditionally pick deep filter responses to encode them into the final representation, which considers the importance of filter responses themselves.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Learning Non-maximum Suppression

A new network architecture designed to perform non-maximum suppression (NMS), using only boxes and their score, shows promise providing improved localization and occlusion handling.

Saccader: Improving Accuracy of Hard Attention Models for Vision

Key to Saccader is a pretraining step that requires only class labels and provides initial attention locations for policy gradient optimization, which narrow the gap to common ImageNet baselines.