Learning Semantic-Aware Dynamics for Video Prediction

  title={Learning Semantic-Aware Dynamics for Video Prediction},
  author={Xinzhu Bei and Yanchao Yang and Stefan 0 Soatto},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
We propose an architecture and training scheme to predict video frames by explicitly modeling dis-occlusions and capturing the evolution of semantically consistent regions in the video. The scene layout (semantic map) and motion (optical flow) are decomposed into layers, which are predicted and fused with their context to generate future layouts and motions. The appearance of the scene is warped from past frames using the predicted motion in co-visible regions; dis-occluded regions are… Expand
Semantic Prediction: Which One Should Come First, Recognition or Prediction?
This work investigates configurations using the Local Frequency Domain Transformer Network (LFDTN) as the video prediction model and U-Net as the semantic extraction model on synthetic and real datasets. Expand
PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning
PredRNN, a new recurrent network, in which a pair of memory cells are explicitly decoupled, operate in nearly independent transition manners, and finally form unified representations of the complex environment, is presented. Expand


Future Video Synthesis With Object Motion Prediction
An approach to predict future video frames given a sequence of continuous video frames in the past by decoupling the background scene and moving objects and shows that this model outperforms the state-of-the-art in terms of visual quality and accuracy. Expand
Decomposing Motion and Content for Natural Video Sequence Prediction
To the best of the knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatiotemporal dynamics for pixel-level future prediction in natural videos. Expand
Generating Videos with Scene Dynamics
A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines. Expand
Flow-Grounded Spatial-Temporal Video Prediction from Still Images
This work forms the multi-frame prediction task as a multiple time step flow (multi-flow) prediction phase followed by a flow-to-frame synthesis phase, which prevents the model from directly looking at the high-dimensional pixel space of the frame sequence and is demonstrated to be more effective in predicting better and diverse results. Expand
Structure Preserving Video Prediction
A RNN structure for video prediction is proposed, which employs temporal-adaptive convolutional kernels to capture time-varying motion patterns as well as tiny objects within a scene. Expand
Future Semantic Segmentation with Convolutional LSTM
A novel model is proposed that uses convolutional LSTM (ConvLSTM) to encode the spatiotemporal information of observed frames for future prediction and outperforms other state-of-the-art methods on the benchmark dataset. Expand
Learning to Forecast and Refine Residual Motion for Image-to-Video Generation
This work trains networks to learn residual motion between the current and future frames, which avoids learning motion-irrelevant details and proposes a two-stage generation framework where videos are generated from structures and then refined by temporal signals. Expand
Probabilistic Video Generation using Holistic Attribute Control
Improve the video generation consistency through temporally-conditional sampling and quality by structuring the latent space with attribute controls; ensuring that attributes can be both inferred and conditioned on during learning/generation. Expand
Disentangling Propagation and Generation for Video Prediction
A computational model for high-fidelity video prediction which disentangles motion-specific propagation from motion-agnostic generation and introduces a confidence-aware warping operator which gates the output of pixel predictions from a flow predictor for non-occluded regions and from a context encoder for occluded regions. Expand
Video Frame Synthesis Using Deep Voxel Flow
This work addresses the problem of synthesizing new video frames in an existing video, either in-between existing frames (interpolation), or subsequent to them (extrapolation), by training a deep network that learns to synthesize video frames by flowing pixel values from existing ones, which is called deep voxel flow. Expand