Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks

@article{Cao2015LookAT,
  title={Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks},
  author={Chunshui Cao and Xianming Liu and Yi Yang and Yinan Yu and Jiang Wang and Zilei Wang and Yongzhen Huang and Liang Wang and Chang Huang and Wei Xu and Deva Ramanan and Thomas S. Huang},
  journal={2015 IEEE International Conference on Computer Vision (ICCV)},
  year={2015},
  pages={2956-2964}
}
While feedforward deep convolutional neural networks (CNNs) have been a great success in computer vision, it is important to note that the human visual cortex generally contains more feedback than feedforward connections. In this paper, we will briefly introduce the background of feedbacks in the human visual cortex, which motivates us to develop a computational feedback mechanism in deep neural networks. In addition to the feedforward inference in traditional neural networks, a feedback loop… 
Global Perception Feedback Convolutional Neural Networks
TLDR
A global perception feedback convolutional neural network that considers the global structure of visual response during feedback inference and eliminates “Visual illusions” that are produced in the process of visual attention.
Feedback Convolutional Neural Network for Visual Localization and Segmentation
TLDR
It is claimed that feedback plays a critical role in understanding convolutional neural networks (CNNs), e.g., how a neuron in CNNs describes an object's pattern, and how a collection of neurons form comprehensive perception to an object.
TDAN: Top-Down Attention Networks for Enhanced Feature Selectivity in CNNs
TLDR
A lightweight topdown (TD) attention module that iteratively generates a “visual searchlight” to perform top-down channel and spatial modulation of its inputs and consequently outputs more selective feature activations at each computation step is proposed.
Lateral Inhibition-Inspired Convolutional Neural Network for Visual Attention and Saliency Detection
TLDR
This paper proposes to formulate lateral inhibition inspired by the related studies from neurobiology, and embed it into the top-down gradient computation of a general CNN for classification, i.e. only category-level information is used.
Human vs Machine Attention in Neural Networks: A Comparative Study
TLDR
The overall results demonstrate that human attention is capable of bench-marking the meaningful `ground-truth' in attention-driven tasks, where the more the artificial attention is close to the human attention, the better the performance; for higher-level vision tasks, it is case-by-case.
Deep Networks for Human Visual Attention: A Hybrid Model Using Foveal Vision
TLDR
This work proposes a biologically inspired object classification and localization framework that combines Deep Convolutional Neural Networks with foveal vision and demonstrates that the results demonstrate that one does not need to store and transmit all the information present on high-resolution images since, beyond a certain amount of preserved information, the performance in the Classification and localization task saturates.
Top-Down Neural Attention by Excitation Backprop
TLDR
A new backpropagation scheme, called Excitation Backprop, is proposed to pass along top-down signals downwards in the network hierarchy via a probabilistic Winner-Take-All process, and the concept of contrastive attention is introduced to make the top- down attention maps more discriminative.
Learning what and where to attend
TLDR
A state-of-the-art attention network is extended and it is demonstrated that adding ClickMe supervision significantly improves its accuracy and yields visual features that are more interpretable and more similar to those used by human observers.
L EARNING WHAT AND WHERE TO ATTEND
Most recent gains in visual recognition have originated from the inclusion of attention mechanisms in deep convolutional networks (DCNs). Because these networks are optimized for object recognition,
Recurrent Mixture Density Network for Spatiotemporal Visual Attention
TLDR
A spatiotemporal attentional model that learns where to look in a video directly from human fixation data, and is optimized via maximum likelihood estimation using human fixations as training data, without knowledge of the action in each video.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 40 REFERENCES
Recurrent Models of Visual Attention
TLDR
A novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution is presented.
Deep Networks with Internal Selective Attention through Feedback Connections
TLDR
DasNet harnesses the power of sequential processing to improve classification performance, by allowing the network to iteratively focus its internal attention on some of its convolutional filters.
Network In Network
TLDR
With enhanced local modeling via the micro network, the proposed deep network structure NIN is able to utilize global average pooling over feature maps in the classification layer, which is easier to interpret and less prone to overfitting than traditional fully connected layers.
Going deeper with convolutions
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition
Very Deep Convolutional Networks for Large-Scale Image Recognition
TLDR
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
DRAW: A Recurrent Neural Network For Image Generation
TLDR
The Deep Recurrent Attentive Writer neural network architecture for image generation substantially improves on the state of the art for generative models on MNIST, and, when trained on the Street View House Numbers dataset, it generates images that cannot be distinguished from real data with the naked eye.
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
TLDR
This paper addresses the visualisation of image classification models, learnt using deep Convolutional Networks (ConvNets), and establishes the connection between the gradient-based ConvNet visualisation methods and deconvolutional networks.
Hierarchical Bayesian inference in the visual cortex.
  • T. Lee, D. Mumford
  • Computer Science, Medicine
    Journal of the Optical Society of America. A, Optics, image science, and vision
  • 2003
TLDR
This work proposes a new theoretical setting based on the mathematical framework of hierarchical Bayesian inference for reasoning about the visual system, and suggests that the algorithms of particle filtering and Bayesian-belief propagation might model these interactive cortical computations.
Scalable Object Detection Using Deep Neural Networks
TLDR
This work proposes a saliency-inspired neural network model for detection, which predicts a set of class-agnostic bounding boxes along with a single score for each box, corresponding to its likelihood of containing any object of interest.
Attentional Neural Network: Feature Selection Using Cognitive Feedback
TLDR
Attentional Neural Network is a new framework that integrates top-down cognitive bias and bottom-up feature extraction in one coherent architecture that obtain classification accuracy better than or competitive with state of art results on the MNIST variation dataset, and successfully disentangle overlaid digits with high success rates.
...
1
2
3
4
...