• Corpus ID: 235359163

Efficient training for future video generation based on hierarchical disentangled representation of latent variables

@article{Fushishita2021EfficientTF,
  title={Efficient training for future video generation based on hierarchical disentangled representation of latent variables},
  author={Naoya Fushishita and Antonio Tejero-de-Pablos and Yusuke Mukuta and Tatsuya Harada},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.03502}
}
Generating videos predicting the future of a given sequence has been an area of active research in recent years. However, an essential problem remains unsolved: most of the methods require large computational cost and memory usage for training. In this paper, we propose a novel method for generating future prediction videos with less memory usage than the conventional methods. This is a critical stepping stone in the path towards generating videos with high image quality, similar to that of… 

Figures from this paper

References

SHOWING 1-10 OF 30 REFERENCES

Decomposing Motion and Content for Natural Video Sequence Prediction

To the best of the knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatiotemporal dynamics for pixel-level future prediction in natural videos.

Learning to Decompose and Disentangle Representations for Video Prediction

The Decompositional Disentangled Predictive Auto-Encoder (DDPAE) is proposed, a framework that combines structured probabilistic models and deep networks to automatically decompose the high-dimensional video that the authors aim to predict into components, and disentangle each component to have low-dimensional temporal dynamics that are easier to predict.

Efficient Video Generation on Complex Datasets

This work shows that large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video samples of substantially higher complexity than previous work.

Deep Video Generation, Prediction and Completion of Human Action Sequences

This paper proposes a general, two-stage deep framework to generate human action videos with no constraints or arbitrary number of constraints, which uniformly address the three problems: video generation given no input frames, video prediction given the first few frames, and video completionGiven the first and last frames.

Deep multi-scale video prediction beyond mean square error

This work trains a convolutional network to generate future frames given an input sequence and proposes three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function.

Generating Videos with Scene Dynamics

A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines.

The Pose Knows: Video Forecasting by Generating Pose Futures

This work exploits human pose detectors as a free source of supervision and breaks the video forecasting problem into two discrete steps, and uses the structured space of pose as an intermediate representation to sidestep the problems that GANs have in generating video pixels directly.

Transformation-Based Models of Video Sequences

This work proposes a simple unsupervised approach for next frame prediction in video that compares favourably against more sophisticated ones on the UCF-101 data set, while also being more efficient in terms of the number of parameters and computational cost.

MoCoGAN: Decomposing Motion and Content for Video Generation

This work introduces a novel adversarial learning scheme utilizing both image and video discriminators and shows that MoCoGAN allows one to generate videos with same content but different motion as well as videos with different content and same motion.

Hierarchical Video Generation from Orthogonal Information: Optical Flow and Texture

This study focuses on the motion and appearance information as two important orthogonal components of a video, and proposes Flow-and-Texture-Generative Adversarial Networks (FTGAN) consisting of FlowGAN and TextureGAN, which generates more plausible motion videos and achieves significantly improved performance for unsupervised action classification in comparison to previous GAN works.