Near-Optimal Glimpse Sequences for Improved Hard Attention Neural Network Training

  title={Near-Optimal Glimpse Sequences for Improved Hard Attention Neural Network Training},
  author={William Harvey and Michael Teng and Frank Wood},
  journal={2022 International Joint Conference on Neural Networks (IJCNN)},
Hard visual attention is a promising approach to reduce the computational burden of modern computer vision methodologies. However, hard attention mechanisms can be difficult and slow to train, which is especially costly for applications like neural architecture search where multiple networks must be trained. We introduce a method to amortise the cost of training by generating an extra supervision signal for a subset of the training data. This supervision is in the form of sequences of ‘good… 

Figures and Tables from this paper

A Probabilistic Hard Attention Model For Sequentially Observed Scenes

This paper designs an efficient hard attention model for classifying such sequentially observed scenes and uses normalizing flows in Partial VAE to handle multi-modality in the feature-synthesis problem.

Image Completion via Inference in Deep Generative Models

An application that requires an in-painting model with the capabilities the authors' exhibits is described and demonstrated: the use of Bayesian optimal experimental design to select the most informative sequence of small field of view x-rays for chest pathology detection.

Conditional Image Generation by Conditioning Variational Auto-Encoders

We present a conditional variational auto-encoder (VAE) which, to avoid the substantial cost of training from scratch, uses an architecture and training objective capable of leveraging a foundation


We present a conditional variational auto-encoder (VAE) which, to avoid the substantial cost of training from scratch, uses an architecture and training objective capable of leveraging a foundation



Saccader: Improving Accuracy of Hard Attention Models for Vision

Key to Saccader is a pretraining step that requires only class labels and provides initial attention locations for policy gradient optimization, which narrow the gap to common ImageNet baselines.

Learning Hard Alignments with Variational Inference

This paper tackles the problem of learning hard attention for a sequential task using variational inference methods, specifically the recently introduced Variational Inference for Monte Carlo Objectives (VIMCO), and proposes a novel baseline that adapts VIMCO to this setting.

Neural Architecture Search with Reinforcement Learning

This paper uses a recurrent network to generate the model descriptions of neural networks and trains this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set.

Context Encoders: Feature Learning by Inpainting

It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Recurrent Models of Visual Attention

A novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution is presented.

Processing Megapixel Images with Deep Attention-Sampling Models

A fully differentiable end-to-end trainable model that samples and processes only a fraction of the full resolution input image and is evaluated on three classification tasks, where it allows to reduce computation and memory footprint by an order of magnitude for the same accuracy as classical architectures.

Supervising Neural Attention Models for Video Captioning by Human Gaze Data

This paper proposes a video captioning model named Gaze Encoding Attention Network (GEAN) that can leverage gaze tracking information to provide the spatial and temporal attention for sentence generation and demonstrates that spatial attentions guided by human gaze data indeed improve the performance of multiple captioning methods.

Exploring Human-like Attention Supervision in Visual Question Answering

The experiments show that adding human-like attention supervision to an attention-based VQA model yields a more accurate attention together with a better performance, showing a promising future for human- like attention supervision in V QA.

Very Deep Convolutional Networks for Large-Scale Image Recognition

This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.