• Corpus ID: 17195923

Recurrent Models of Visual Attention

@article{Mnih2014RecurrentMO,
  title={Recurrent Models of Visual Attention},
  author={Volodymyr Mnih and Nicolas Manfred Otto Heess and Alex Graves and Koray Kavukcuoglu},
  journal={ArXiv},
  year={2014},
  volume={abs/1406.6247}
}
Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. [] Key Method While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where…

Figures and Tables from this paper

Comparison of Neuronal Attention Models
TLDR
The purpose of this paper is to explain and also test each of the NAM's parameters, and to show that it can efficiently choose several small regions from the initial image to focus on.
Capacity Visual Attention Networks
TLDR
An attentionbased model that automatically learns to extract information from an image by adaptively assigning its capacity across different portions of the input data and only processing the selected regions of different sizes at high resolution is introduced.
Recurrent Mixture Density Network for Spatiotemporal Visual Attention
TLDR
A spatiotemporal attentional model that learns where to look in a video directly from human fixation data, and is optimized via maximum likelihood estimation using human fixations as training data, without knowledge of the action in each video.
Spatially Adaptive Computation Time for Residual Networks
TLDR
Experimental results are presented showing that this model improves the computational efficiency of Residual Networks on the challenging ImageNet classification and COCO object detection datasets and the computation time maps on the visual saliency dataset cat2000 correlate surprisingly well with human eye fixation positions.
Elman and Jordan Recurrence in Convolutional Neural Networks Using Attention Window
TLDR
Five variations of Elman and Jordan recurrence in convolutional neural networks (EJRCNNs) are proposed, each of the five networks takes input of a series of small attention windows, which is cropped out from different locations in the image.
Gaussian RAM: Lightweight Image Classification via Stochastic Retina-Inspired Glimpse and Reinforcement Learning
  • Dongseok Shim, H. Kim
  • Computer Science
    2020 20th International Conference on Control, Automation and Systems (ICCAS)
  • 2020
TLDR
A Gaussian Deep Recurrent visual Attention Model (GDRAM) - a reinforcement learning based lightweight deep neural network for large scale image classification that outperforms the conventional CNN (Convolutional Neural Network) which uses the entire image as input.
Glance and Focus: a Dynamic Approach to Reducing Spatial Redundancy in Image Classification
TLDR
This work proposes a novel framework that performs efficient image classification by processing a sequence of relatively small inputs, which are strategically selected from the original image with reinforcement learning, which consistently improves the computational efficiency of a wide variety of deep models.
Glance and Focus Networks for Dynamic Visual Recognition
TLDR
The proposed Glance and Focus Network (GFNet) first extracts a quick global representation of the input image at a low resolution scale, and then strategically attends to a series of salient regions to learn finer features, mimicking the human visual system.
Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks
TLDR
The background of feedbacks in the human visual cortex is introduced, which motivates the development of a computational feedback mechanism in deep neural networks, and a feedback loop is introduced to infer the activation status of hidden layer neurons according to the "goal" of the network.
RATM: Recurrent Attentive Tracking Model
TLDR
The proposed RATM performs well on all three tasks and can generalize to related but previously unseen sequences from a challenging tracking data set.
...
...

References

SHOWING 1-10 OF 30 REFERENCES
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks
TLDR
This integrated framework for using Convolutional Networks for classification, localization and detection is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 and obtained very competitive results for the detection and classifications tasks.
Learning Where to Attend with Deep Architectures for Image Tracking
TLDR
An attentional model for simultaneous object tracking and recognition that is driven by gaze data is discussed, and a straightforward extension of the existing approach to the partial information setting results in poor performance, and an alternative method based on modeling the reward surface as a gaussian process is proposed.
Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths
TLDR
This work complements one of the largest and most challenging static computer vision datasets, VOC 2012 Actions, with human eye movement recordings collected under the primary task constraint of action recognition, as well as for context recognition, in order to analyze the impact of different tasks.
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
TLDR
This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.
Rapid object detection using a boosted cascade of simple features
  • Paul A. Viola, Michael J. Jones
  • Computer Science
    Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001
  • 2001
TLDR
A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Q-learning of sequential attention for visual object recognition from informative local descriptors
TLDR
This work provides a framework for learning sequential attention in real-world visual object recognition, using an architecture of three processing stages that integrates local information via shifts of attention, resulting in chains of descriptor-action pairs that characterize object discrimination.
A Model of Saliency-Based Visual Attention for Rapid Scene Analysis
TLDR
A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented, which breaks down the complex problem of scene understanding by rapidly selecting conspicuous locations to be analyzed in detail.
Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search.
TLDR
An original approach of attentional guidance by global scene context is presented that combines bottom-up saliency, scene context, and top-down mechanisms at an early stage of visual processing and predicts the image regions likely to be fixated by human observers performing natural search tasks in real-world scenes.
Simple statistical gradient-following algorithms for connectionist reinforcement learning
TLDR
This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units that are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement tasks and certain limited forms of delayed-reInforcement tasks, and they do this without explicitly computing gradient estimates.
...
...