TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection

  title={TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection},
  author={Kyle Min and Jason J. Corso},
  journal={2019 IEEE/CVF International Conference on Computer Vision (ICCV)},
  • Kyle MinJason J. Corso
  • Published 15 August 2019
  • Computer Science
  • 2019 IEEE/CVF International Conference on Computer Vision (ICCV)
TASED-Net is a 3D fully-convolutional network architecture for video saliency detection. It consists of two building blocks: first, the encoder network extracts low-resolution spatiotemporal features from an input clip of several consecutive frames, and then the following prediction network decodes the encoded features spatially while aggregating all the temporal information. As a result, a single prediction map is produced from an input clip of multiple frames. Frame-wise saliency maps can be… 

Figures and Tables from this paper

TSMSAN: A Three-Stream Multi-Scale Attentive Network for Video Saliency Detection

The proposed three-stream multi-scale attentive network (TSMSAN) for saliency detection in dynamic scenes integrates motion vector representation, static saliency map, and RGB information in multi-scales together into one framework on the basis of Fully Convolutional Network (FCN) and spatial attention mechanism.

A Spatial-Temporal Recurrent Neural Network for Video Saliency Prediction

An attention-aware convolutional long short term memory (ConvLSTM) network is developed to predict salient regions based on the features extracted from consecutive frames and generate the ultimate saliency map for each video frame.

A Novel Spatio-Temporal 3D Convolutional Encoder-Decoder Network for Dynamic Saliency Prediction

A novel 3D convolutional encoder-decoder architecture is proposed for saliency prediction on dynamic scenes that achieves the state-of-the-art performance on key metrics including both normalized scanpath saliency and Pearson's correlation coefficient.

Temporal-Spatial Feature Pyramid for Video Saliency Detection

A 3D fully convolutional encoder-decoder architecture for video saliency detection, which combines scale, space and time information for videoSaliency modeling, and the results indicate that the well-designed structure can improve the precision of video Saliency detection significantly.

Spatio-Temporal Self-Attention Network for Video Saliency Prediction

A novel Spatio-Temporal Self-Attention 3D Network (STSANet) is proposed for video saliency prediction, in which multiple spatio-temporal self-attention modules are employed at different levels of 3D convolutional backbone to directly capture long-range relations between spatio-tem temporal features of different time steps.

Video Saliency Detection with Domain Adaptation using Hierarchical Gradient Reversal Layers

This work proposes a 3D fully convolutional architecture for video saliency detection that employs multi-head supervision on intermediate maps generated using features extracted at different abstraction level and outperforms state-of-the-art methods on most metrics for supervised saliency prediction.

Video Saliency Forecasting Transformer

This paper proposes a video saliency forecasting transformer (VSFT) network built on a new encoder-decoder architecture and is the first pure-transformer based architecture in the VSP field and is freed from the dependency of the pretrained S3D model.

Learning to Explore Saliency for Stereoscopic Videos Via Component-Based Interaction

A saliency prediction model for stereoscopic videos that learns to explore saliency inspired by the component-based interactions including spatial, temporal, as well as depth cues is devise that can achieve superior performance compared to the state-of-the-art methods.

A Gated Fusion Network for Dynamic Saliency Prediction

This article introduces the GFSalNet, the first deep saliency model capable of making predictions in a dynamic way via the gated fusion mechanism, and shows that it has a good generalization ability, and exploits temporal information more effectively via its adaptive fusion scheme.



Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM

A novel DNN-based video saliency prediction method that establishes a large-scale eye-tracking database of videos (LEDOV), and develops a two-layer convolutional long short-term memory (2C-LSTM) network in the method, using the extracted features of OM-CNN as the input.

Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction

The use of deep learning for dynamic saliency prediction and the so-called spatio-temporal saliency networks, where the architecture of two-stream networks are investigated to integrate spatial and temporal information are proposed.

Revisiting Video Saliency: A Large-Scale Benchmark and a New Model

This work introduces a new benchmark for predicting human eye movements during dynamic scene free-viewing, and proposes a novel video saliency model that augments the CNN-LSTM network architecture with an attention mechanism to enable fast, end-to-end saliency learning.

Deep Visual Attention Prediction.

  • Wenguan WangJianbing Shen
  • Computer Science
    IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
  • 2018
This model is based on a skip-layer network structure, which predicts human attention from multiple convolutional layers with various reception fields, and incorporates multi-level saliency predictions within a single network, which significantly decreases the redundancy of previous approaches of learning multiple network streams with different input scales.

A Spatiotemporal Saliency Model for Video Surveillance

The proposed spatiotemporal saliency model is more correlated to visual saliency perception of surveillance videos and has been compared to the gaze maps obtained from subjective experiments with SMI eye tracker for surveillance video sequences.

Recurrent Mixture Density Network for Spatiotemporal Visual Attention

A spatiotemporal attentional model that learns where to look in a video directly from human fixation data, and is optimized via maximum likelihood estimation using human fixations as training data, without knowledge of the action in each video.

Shallow and Deep Convolutional Networks for Saliency Prediction

This paper addresses the problem with a completely data-driven approach by training a convolutional neural network (convnet) and proposes two designs: a shallow convnet trained from scratch, and a another deeper solution whose first three layers are adapted from another network trained for classification.

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures, including FCN and DeconvNet.

Static saliency vs. dynamic saliency: a comparative study

A novel camera motion and image saliency aware model for dynamic saliency prediction that outperforms the state-of-the-art methods for dynamic Saliency prediction for dynamic video captioning is proposed.

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

It is shown that it is possible to replace many of the 3D convolutions by low-cost 2D convolution, suggesting that temporal representation learning on high-level “semantic” features is more useful.