Learning Gaze Transitions from Depth to Improve Video Saliency Estimation

@article{Leifman2017LearningGT,
  title={Learning Gaze Transitions from Depth to Improve Video Saliency Estimation},
  author={George Leifman and Dmitry Rudoy and Tristan Swedish and Eduardo Bayro-Corrochano and Ramesh Raskar},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={1707-1716}
}
  • G. Leifman, D. Rudoy, R. Raskar
  • Published 11 March 2016
  • Computer Science
  • 2017 IEEE International Conference on Computer Vision (ICCV)
In this paper we introduce a novel Depth-Aware Video Saliency approach to predict human focus of attention when viewing videos that contain a depth map (RGBD) on a 2D screen. Saliency estimation in this scenario is highly important since in the near future 3D video content will be easily acquired yet hard to display. Despite considerable progress in 3D display technologies, most are still expensive and require special glasses for viewing, so RGBD content is primarily viewed on 2D screens… 

Figures and Tables from this paper

Revisiting Video Saliency Prediction in the Deep Learning Era
TLDR
A new benchmark, called DHF1K (Dynamic Human Fixation 1K), is introduced, for predicting fixations during dynamic scene free-viewing, and a novel video saliency model is proposed, called ACLNet (Attentive CNN-LSTM Network), that augments the CNN- LSTM architecture with a supervised attention mechanism to enable fast end-to-end saliency learning.
Video Saliency Prediction via Deep Eye Movement Learning
TLDR
Compared with previous methods that use human fixations as ground truth, the proposed EML model uses the optical flow of fixations between successive frames as an extra ground truth for the purpose of eye movement learning.
Going from Image to Video Saliency: Augmenting Image Salience with Dynamic Attentional Push
TLDR
A novel method to incorporate the recent advent in static saliency models to predict the saliency in videos by proposing a multi-stream Convolutional Long Short-Term Memory network (ConvLSTM) structure which augments state-of-the-art in static Saliency models with dynamic Attentional Push.
Saliency Prediction in the Deep Learning Era: An Empirical Investigation
TLDR
A large number of image and video saliency models are reviewed and compared over two image benchmarks and two large scale video datasets and factors that contribute to the gap between models and humans are identified.
Saliency Prediction in the Deep Learning Era: Successes and Limitations
  • A. Borji
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2021
TLDR
A large number of image and video saliency models are reviewed and compared over two image benchmarks and two large scale video datasets and factors that contribute to the gap between models and humans are identified.
RT-BENE: A Dataset and Baselines for Real-Time Blink Estimation in Natural Environments
TLDR
It is argued that this work will benefit both gaze estimation and blink estimation methods, and the proposed RT-BENE baselines in the recently presented RT-GENE gaze estimation framework provide a real-time inference of the openness of the eyes.
Saliency detection in deep learning era: trends of development
TLDR
A detailed survey of saliency detection methods in deep learning era allows to understand the current possibilities of CNN approach for visual analysis conducted by the human eyes’ tracking and digital image processing.
Saliency Prediction Network for $360^\circ$ Videos
TLDR
A saliency prediction network for panoramic videos that incorporates the global feature of video frames, residual attention and Gaussian priors into the network by considering the viewing behavior of <inline-formula><tex-math notation="LaTeX">$360^\circ$</tex- math></inline- formula> videos, which is useful for performance improvement.
Saliency Prediction in the Deep Learning Era: Successes, Limitations, and Future Challenges
TLDR
A large number of image and video saliency models are reviewed and compared over two image benchmarks and two large scale video datasets and factors that contribute to the gap between models and humans are identified.
STAViS: Spatio-Temporal AudioVisual Saliency Network
TLDR
Evaluation results across databases indicate that the STAViS model outperforms the authors' visual only variant as well as the other state-of-the-art models in the majority of cases, and indicates that it is appropriate for estimating saliency "in thewild".
...
...

References

SHOWING 1-10 OF 75 REFERENCES
Learning Video Saliency from Human Gaze Using Candidate Selection
TLDR
A novel method for video saliency estimation, inspired by the way people watch videos, is proposed, which explicitly model the continuity of the video by predicting the saliency map of a given frame, conditioned on the map from the previous frame.
Depth Matters: Influence of Depth Cues on Visual Saliency
TLDR
This work collects a large human eye fixation database compiled from a pool of 600 2D-vs-3D image pairs viewed by 80 subjects, where the depth information is directly provided by the Kinect camera and the eye tracking data are captured in both 2D and 3D free-viewing experiments.
Depth really Matters: Improving Visual Salient Region Detection with Depth
TLDR
A 3D-saliency formulation that takes into account structural features of objects in an indoor setting to identify regions at salient depth levels is proposed that integrates depth and geometric features of object surfaces in indoor scenes.
Video Saliency Detection via Dynamic Consistent Spatio-Temporal Attention Modelling
TLDR
Empirical validations demonstrate the salient regions detected by the dynamic consistent saliency map highlight the interesting objects effectively and efficiency and are consistent with the ground truth saliency maps of eye movement data.
Predicting visual fixations on video based on low-level visual features
RGBD Salient Object Detection: A Benchmark and Algorithms
TLDR
A simple fusion framework that combines existing RGB-produced saliency with new depth-induced saliency and a specialized multi-stage RGBD model is proposed which takes account of both depth and appearance cues derived from low-level feature contrast, mid-level region grouping and high-level priors enhancement.
Leveraging stereopsis for saliency analysis
TLDR
This paper explores stereopsis for saliency analysis and presents two approaches to stereo saliency detection from stereoscopic images, one based on the global disparity contrast in the input image and one that leverages domain knowledge in stereoscopic photography.
Automatic foveation for video compression using a neurobiological model of visual attention
  • L. Itti
  • Computer Science
    IEEE Transactions on Image Processing
  • 2004
TLDR
A general-purpose usefulness of the algorithm is suggested in improving compression ratios of unconstrained video, based on a nonlinear integration of low-level visual cues, mimicking processing in primate occipital, and posterior parietal cortex.
Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling
TLDR
The technique can be used to automatically convert a monoscopic video into stereo for 3D visualization, and is demonstrated through a variety of visually pleasing results for indoor and outdoor scenes, including results from the feature film Charade.
Unsupervised Video Analysis Based on a Spatiotemporal Saliency Detector
TLDR
This paper proposes a new approach for detecting spatiotemporal visual saliency based on the phase spectrum of the videos, which is easy to implement and computationally efficient and is evaluated on several commonly used datasets.
...
...