Dense Semantic Forecasting in Video by Joint Regression of Features and Feature Motion

@inproceedings{vSaric2021DenseSF,
  title={Dense Semantic Forecasting in Video by Joint Regression of Features and Feature Motion},
  author={Josip vSari'c and Sacha Vravzi'c and Sinivsa vSegvi'c},
  year={2021}
}
Dense semantic forecasting anticipates future events in video by inferring pixel-level semantics of an unobserved future image. We present a novel approach that is applicable to various single-frame architectures and tasks. Our approach consists of two modules. Feature-to-motion (F2M) module forecasts a dense deformation field that warps past features into their future positions. Feature-to-feature (F2F) module regresses the future features directly and is therefore able to account for emergent… 

Joint Forecasting of Panoptic Segmentations with Difference Attention

TLDR
A new panoptic segmentation forecasting model is studied that jointly forecasts all object instances in a scene using a transformer model based on ‘difference attention’ that further refines the predictions by taking depth estimates into account.

Joint Forecasting of Panoptic Segmentations with Difference Attention

Forecasting of a representation is important for safe and effective autonomy. For this, panoptic segmentations have been studied as a compelling representation in recent work. However, recent

References

SHOWING 1-10 OF 54 REFERENCES

Warp to the Future: Joint Forecasting of Features and Feature Motion

TLDR
This work considers a novel F2M (feature-to-motion) formulation, which performs the forecast by warping observed features according to regressed feature flow, and operates in synergy with correlation features and outperforms all previous approaches both in short-term and mid-term forecast on the Cityscapes dataset.

Segmenting the Future

TLDR
A temporal encoder-decoder network architecture that encodes RGB frames from the past and decodes the future semantic segmentation of the scene segments while simultaneously accounting for the object dynamics to infer the future scene semantic segments is proposed.

Predicting Deeper into the Future of Semantic Segmentation

TLDR
An autoregressive convolutional neural network that learns to iteratively generate multiple frames is developed and results show that directly predicting future segmentations is substantially better than predicting and then segmenting future RGB frames.

Single Level Feature-to-Feature Forecasting with Deformable Convolutions

TLDR
The method is based on a semantic segmentation model without lateral connections within the upsampling path that achieves state of the art performance on the Cityscapes validation set when forecasting nine timesteps into the future.

Predicting Future Instance Segmentations by Forecasting Convolutional Features

TLDR
A predictive model is developed in the space of fixed-sized convolutional features of the Mask R-CNN instance segmentation model that significantly improves over baselines based on optical flow.

Predicting Scene Parsing and Motion Dynamics in the Future

TLDR
A novel model to simultaneously predict scene parsing and optical flow in unobserved future video frames is proposed and shows significantly better parsing and motion prediction results when compared to well-established baselines and individual prediction models on the large-scale Cityscapes dataset.

Predictive Feature Learning for Future Segmentation Prediction

TLDR
This work builds an autoencoder which serves as a bridge between the segmentation features and the predictor, and introduces a reconstruction constraint in the prediction module to reduce the risk of vanishing the suppressed details during recurrent feature prediction.

Recurrent Flow-Guided Semantic Forecasting

TLDR
This work proposes to decompose the challenging semantic forecasting task into two subtasks: current frame segmentation and future optical flow prediction, and builds an efficient, effective, low overhead model with three main components: flow prediction network, feature-flow aggregation LSTM, and end-to-end learnable warp layer.

Flow-Grounded Spatial-Temporal Video Prediction from Still Images

TLDR
This work forms the multi-frame prediction task as a multiple time step flow (multi-flow) prediction phase followed by a flow-to-frame synthesis phase, which prevents the model from directly looking at the high-dimensional pixel space of the frame sequence and is demonstrated to be more effective in predicting better and diverse results.

Future Semantic Segmentation with Convolutional LSTM

TLDR
A novel model is proposed that uses convolutional LSTM (ConvLSTM) to encode the spatiotemporal information of observed frames for future prediction and outperforms other state-of-the-art methods on the benchmark dataset.
...