Corpus ID: 233169023

Differentiable Patch Selection for Image Recognition

  title={Differentiable Patch Selection for Image Recognition},
  author={Jean-Baptiste Cordonnier and Aravindh Mahendran and A. Dosovitskiy and Dirk Weissenborn and Jakob Uszkoreit and Thomas Unterthiner},
Neural Networks require large amounts of memory and compute to process high resolution images, even when only a small part of the image is actually informative for the task at hand. We propose a method based on a differentiable Top-K operator to select the most relevant parts of the input to efficiently process high resolution images. Our method may be interfaced with any downstream neural network, is able to aggregate information from different patches in a flexible way, and allows the whole… Expand
MIST: Multiple Instance Spatial Transformer
We propose a deep network that can be trained to tackle image reconstruction and classification problems that involve detection of multiple object instances, without any supervision regarding theirExpand
Grid Partitioned Attention: Efficient TransformerApproximation with Inductive Bias for High Resolution Detail Generation
This paper introduces the new attention layer, analyzes its complexity and how the trade-off between memory usage and model power can be tuned by the hyper-parameters, and shows how such attention enables novel deep learning architectures with copying modules that are especially useful for conditional image generation tasks like pose morphing. Expand
Towards mental time travel: a hierarchical memory for reinforcement learning agents
It is shown that agents with HTM substantially outperform agents with other memory architectures at tasks requiring long-term recall, retention, or reasoning over memory, and is a step towards agents that can learn, interact, and adapt in complex and temporally-extended environments. Expand
Dynamic Neural Networks: A Survey
This survey comprehensively review this rapidly developing area of dynamic networks by dividing dynamic networks into three main categories: sample-wise dynamic models that process each sample with data-dependent architectures or parameters; spatial-wiseynamic networks that conduct adaptive computation with respect to different spatial locations of image data; and temporal-wise Dynamic networks that perform adaptive inference along the temporal dimension for sequential data. Expand


Recurrent Models of Visual Attention
A novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution is presented. Expand
Multiple Object Recognition with Visual Attention
The model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image and it is shown that the model learns to both localize and recognize multiple objects despite being given only class labels during training. Expand
Spatial Transformer Networks
This work introduces a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network, and can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps. Expand
Processing Megapixel Images with Deep Attention-Sampling Models
A fully differentiable end-to-end trainable model that samples and processes only a fraction of the full resolution input image and is evaluated on three classification tasks, where it allows to reduce computation and memory footprint by an order of magnitude for the same accuracy as classical architectures. Expand
MIST: Multiple Instance Spatial Transformer Network
This work proposes a deep network that can be trained to tackle image reconstruction and classification problems that involve detection of multiple object instances, without any supervision regarding their whereabouts, and is able to learn to detect recurrent structures in the training dataset by learning to reconstruct images. Expand
Bilinear CNN Models for Fine-Grained Visual Recognition
We propose bilinear models, a recognition architecture that consists of two feature extractors whose outputs are multiplied using outer product at each location of the image and pooled to obtain anExpand
Picking Deep Filter Responses for Fine-Grained Image Recognition
This paper proposes an automatic fine-grained recognition approach which is free of any object / part annotation at both training and testing stages, and conditionally pick deep filter responses to encode them into the final representation, which considers the importance of filter responses themselves. Expand
Deep Residual Learning for Image Recognition
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. Expand
Learning Non-maximum Suppression
A new network architecture designed to perform non-maximum suppression (NMS), using only boxes and their score, shows promise providing improved localization and occlusion handling. Expand
Saccader: Improving Accuracy of Hard Attention Models for Vision
Key to Saccader is a pretraining step that requires only class labels and provides initial attention locations for policy gradient optimization, which narrow the gap to common ImageNet baselines. Expand