Cubic LSTMs for Video Prediction

@inproceedings{Fan2019CubicLF,
  title={Cubic LSTMs for Video Prediction},
  author={Hehe Fan and Linchao Zhu and Yi Yang},
  booktitle={AAAI},
  year={2019}
}
Predicting future frames in videos has become a promising direction of research for both computer vision and robot learning communities. The core of this problem involves moving object capture and future motion prediction. While object capture specifies which objects are moving in videos, motion prediction describes their future dynamics. Motivated by this analysis, we propose a Cubic Long Short-Term Memory (CubicLSTM) unit for video prediction. CubicLSTM consists of three branches, i.e., a… Expand
From Video Classification to Video Prediction: Deep Learning Approaches to Video Modelling
TLDR
A selective multi-instance learning method is proposed to automatically detect whether an event of interest happens in temporally untrimmed videos which usually consist of multiple video shots, and a series of point spatio-temporal correlation recurrent neural networks are proposed to jointly aggregate point coordinates and features in input, states and output. Expand
A Quadruple Diffusion Convolutional Recurrent Network for Human Motion Prediction
TLDR
This work presents a novel diffusion convolutional recurrent predictor for spatial and temporal movement forecasting, with multi-step random walks traversing bidirectionally along an adaptive graph to model interdependency among body joints. Expand
SLAMP: Stochastic Latent Appearance and Motion Prediction
TLDR
This work reason about appearance and motion in the video stochastically by predicting the future based on the motion history, and significantly outperforms state-of-the-art models on two challenging real-world autonomous driving datasets with complex motion and dynamic background. Expand
Self-Supervision by Prediction for Object Discovery in Videos
TLDR
This paper uses the prediction task as self-supervision and builds a novel object-centric model for image sequence representation that can be trained without the help of any manual annotation or pretrained network. Expand
Recurrent Semantic Preserving Generation for Action Prediction
TLDR
This paper develops a generation architecture to complement the sequence of skeletons for predicting the action, which can exploit more potential information of the movement tendency and controls the generation step in a recurrent manner for maximizing the discriminative information of actions. Expand
Aggregating Attentional Dilated Features for Salient Object Detection
This paper presents a novel deep learning model to aggregate the attentional dilated features for salient object detection by exploring the complementary information between the global and localExpand
A review on the long short-term memory model
TLDR
A comprehensive review of LSTM’s formulation and training, relevant applications reported in the literature and code resources implementing this model for a toy example are presented. Expand
Unsupervised Sequence Forecasting of 100,000 Points for Unsupervised Trajectory Forecasting
TLDR
This work studies the problem of future prediction at the level of 3D scenes, represented by point clouds captured by a LiDAR sensor, and presents a new object trajectory forecasting pipeline leveraging SPCSFNet, able to outperform the conventional pipeline using state-of-the-art trajectory forecasting methods. Expand
Sequential Forecasting of 100,000 Points
TLDR
This work studies the problem of future prediction at the level of 3D scenes, represented by point clouds captured by a LiDAR sensor, and presents a new object trajectory forecasting pipeline leveraging SPCSFNet, able to outperform the conventional pipeline using state-of-the-art trajectory forecasting methods. Expand
Activity Image-to-Video Retrieval by Disentangling Appearance and Motion
TLDR
This paper proposes a Motion-assisted Activity Proposal-based Image-to-Video Retrieval (MAP-IVR) approach to disentangle the video features into motion features and appearance features and obtain appearance features from the images. Expand
...
1
2
...

References

SHOWING 1-10 OF 40 REFERENCES
Decomposing Motion and Content for Natural Video Sequence Prediction
TLDR
To the best of the knowledge, this is the first end-to-end trainable network architecture with motion and content separation to model the spatiotemporal dynamics for pixel-level future prediction in natural videos. Expand
Generating Videos with Scene Dynamics
TLDR
A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines. Expand
PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs
TLDR
A predictive recurrent neural network (PredRNN) that achieves the state-of-the-art prediction performance on three video prediction datasets and is a more general framework, that can be easily extended to other predictive learning tasks by integrating with other architectures. Expand
Anticipating Visual Representations from Unlabeled Video
TLDR
This work presents a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects and applies recognition algorithms on the authors' predicted representation to anticipate objects and actions. Expand
Bidirectional Multirate Reconstruction for Temporal Modeling in Videos
  • Linchao Zhu, Zhongwen Xu, Yi Yang
  • Computer Science
  • 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2017
TLDR
An unsupervised temporal modeling method that learns from untrimmed videos that generates the best single feature for event detection with a relative improvement of 10.4% on the MEDTest-13 dataset and achieves the best performance in video captioning across all evaluation metrics on the YouTube2Text dataset. Expand
Deep multi-scale video prediction beyond mean square error
TLDR
This work trains a convolutional network to generate future frames given an input sequence and proposes three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function. Expand
Action-Conditional Video Prediction using Deep Networks in Atari Games
TLDR
This paper is the first to make and evaluate long-term predictions on high-dimensional video conditioned by control inputs and proposes and evaluates two deep neural network architectures that consist of encoding, action-conditional transformation, and decoding layers based on convolutional neural networks and recurrent neural networks. Expand
Video (language) modeling: a baseline for generative models of natural videos
TLDR
For the first time, it is shown that a strong baseline model for unsupervised feature learning using video data can predict non-trivial motions over short video sequences. Expand
Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning
TLDR
The results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure. Expand
Unsupervised Learning of Video Representations using LSTMs
TLDR
This work uses Long Short Term Memory networks to learn representations of video sequences and evaluates the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets. Expand
...
1
2
3
4
...