Predicting Deeper into the Future of Semantic Segmentation

@article{Luc2017PredictingDI,
  title={Predicting Deeper into the Future of Semantic Segmentation},
  author={Pauline Luc and Natalia Neverova and Camille Couprie and Jakob J. Verbeek and Yann LeCun},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={648-657}
}
The ability to predict and therefore to anticipate the future is an important attribute of intelligence. It is also of utmost importance in real-time systems, e.g. in robotics or autonomous driving, which depend on visual scene understanding for decision making. While prediction of the raw RGB pixel values in future video frames has been studied in previous work, here we introduce the novel task of predicting semantic segmentations of future frames. Given a sequence of video frames, our goal is… 

Figures and Tables from this paper

Segmenting the Future
TLDR
A temporal encoder-decoder network architecture that encodes RGB frames from the past and decodes the future semantic segmentation of the scene segments while simultaneously accounting for the object dynamics to infer the future scene semantic segments is proposed.
Future Semantic Segmentation with Convolutional LSTM
TLDR
A novel model is proposed that uses convolutional LSTM (ConvLSTM) to encode the spatiotemporal information of observed frames for future prediction and outperforms other state-of-the-art methods on the benchmark dataset.
Multi-Timescale Context Encoding for Scene Parsing Prediction
  • Xin Chen, Yahong Han
  • Computer Science
    2019 IEEE International Conference on Multimedia and Expo (ICME)
  • 2019
TLDR
This work proposes a novel network which catches both the short-term and long-term relations of observed frames for future scene parsing and introduces an attention mechanism to model semantic interdependencies between consecutive frames and a modified convolutional LSTM to model the correlations among all the input frames.
Predicting Future Instance Segmentations by Forecasting Convolutional Features
TLDR
A predictive model is developed in the space of fixed-sized convolutional features of the Mask R-CNN instance segmentation model that significantly improves over baselines based on optical flow.
Future Semantic Segmentation Using 3D Structure
TLDR
This work addresses the task of predicting future frame segmentation from a stream of monocular video by leveraging the 3D structure of the scene, and proposes a recurrent neural network based model capable of predicted future ego-motion trajectory as a function of a series of past ego- motion steps.
Recurrent Flow-Guided Semantic Forecasting
TLDR
This work proposes to decompose the challenging semantic forecasting task into two subtasks: current frame segmentation and future optical flow prediction, and builds an efficient, effective, low overhead model with three main components: flow prediction network, feature-flow aggregation LSTM, and end-to-end learnable warp layer.
Self-supervised learning of predictive segmentation models from video
TLDR
A predictive model in the space of high-level convolutional image features of the Mask R-CNN instance segmentation model is developed, able to produce visually pleasing segmentations at a high resolution for complex scenes involving a large number of instances, and with convincing accuracy up to half a second ahead.
Joint Future Semantic and Instance Segmentation Prediction
TLDR
A novel prediction approach that encodes instance and semantic segmentation information in a single representation based on distance maps is introduced, and graph-based modeling of the instance segmentation prediction problem allows for temporal tracks of the objects as an optimal solution to a watershed algorithm.
Predictive Feature Learning for Future Segmentation Prediction
TLDR
This work builds an autoencoder which serves as a bridge between the segmentation features and the predictor, and introduces a reconstruction constraint in the prediction module to reduce the risk of vanishing the suppressed details during recurrent feature prediction.
Future Segmentation Using 3D Structure
TLDR
This work addresses the task of predicting future frame segmentation from a stream of monocular video by leveraging the 3D structure of the scene, and proposes a recurrent neural network based model capable of predicted future ego-motion trajectory as a function of a series of past ego- motion steps.
...
...

References

SHOWING 1-10 OF 57 REFERENCES
Semantic Video Segmentation by Gated Recurrent Flow Propagation
TLDR
A deep, end-to-end trainable methodology for video segmentation that is capable of leveraging the information present in unlabeled data, besides sparsely labeled frames, in order to improve semantic estimates.
An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders
TLDR
A conditional variational autoencoder is proposed for predicting the dense trajectory of pixels in a scene—what will move in the scene, where it will travel, and how it will deform over the course of one second.
Fully Convolutional Networks for Semantic Segmentation
TLDR
It is shown that convolutional networks by themselves, trained end- to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation.
Anticipating the future by watching unlabeled video
TLDR
A large scale framework that capitalizes on temporal structure in unlabeled video to learn to anticipate both actions and objects in the future, and suggests that learning with unlabeling videos significantly helps forecast actions and anticipate objects.
Transformation-Based Models of Video Sequences
TLDR
This work proposes a simple unsupervised approach for next frame prediction in video that compares favourably against more sophisticated ones on the UCF-101 data set, while also being more efficient in terms of the number of parameters and computational cost.
Deep multi-scale video prediction beyond mean square error
TLDR
This work trains a convolutional network to generate future frames given an input sequence and proposes three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function.
Video Scene Parsing with Predictive Feature Learning
TLDR
It is experimentally proved that the learned predictive features in the model are able to significantly enhance the video parsing performance by combining with the standard image parsing network.
RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation
TLDR
RefineNet is presented, a generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections and introduces chained residual pooling, which captures rich background context in an efficient manner.
Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs
TLDR
This work brings together methods from DCNNs and probabilistic graphical models for addressing the task of pixel-level classification by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF).
Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks
TLDR
A novel approach that models future frames in a probabilistic manner is proposed, namely a Cross Convolutional Network to aid in synthesizing future frames; this network structure encodes image and motion information as feature maps and convolutional kernels, respectively.
...
...