• Corpus ID: 67856106

Frequency Domain Transformer Networks for Video Prediction

  title={Frequency Domain Transformer Networks for Video Prediction},
  author={Hafez Farazi and Sven Behnke},
The task of video prediction is forecasting the next frames given some previous frames. Despite much recent progress, this task is still challenging mainly due to high nonlinearity in the spatial domain. To address this issue, we propose a novel architecture, Frequency Domain Transformer Network (FDTN), which is an end-to-end learnable model that estimates and uses the transformations of the signal in the frequency domain. Experimental evaluations show that this approach can outperform some… 
Motion Segmentation using Frequency Domain Transformer Networks
This work proposes a novel end-to-end learnable architecture that predicts the next frame by modeling foreground and background separately while simultaneously estimating and predicting the foreground motion using Frequency Domain Transformer Networks.
Local Frequency Domain Transformer Networks for Video Prediction
It is demonstrated that the method is readily extended to perform motion segmentation and account for the scene’s composition, and learns to produce reliable predictions in an entirely interpretable manner by only observing unlabeled video data.
Fourier-based Video Prediction through Relational Object Motion
This work explores a different approach to video prediction by using frequency-domain approaches for video prediction and explicitly inferring object-motion relationships in the observed scene and resulting predictions are consistent with the observed dynamics in a scene and do not suffer from blur.
Video Prediction using Local Phase Differences
  • 2020
Video prediction is commonly referred to as the task of forecasting future frames of a video sequence provided several past frames thereof. It remains a challenging domain as visual scenes evolve
Semantic Prediction: Which One Should Come First, Recognition or Prediction?
This work investigates configurations using the Local Frequency Domain Transformer Network (LFDTN) as the video prediction model and U-Net as the semantic extraction model on synthetic and real datasets.
Taylor Swift: Taylor Driven Temporal Modeling for Swift Future Frame Prediction
TayloSwiftNet is introduced, a novel convolutional neural network that learns to estimate the higher order terms of the Taylor series for a given input video and can swiftly predict any desired future frame in just one forward pass and change the temporal resolution on-the-fly.
Utilizing Temporal Information in Deep Convolutional Network for Efficient Soccer Ball Detection and Tracking
This work presents a novel convolutional neural network approach to detect the soccer ball in an image sequence that exploits spatio-temporal correlation and detects the ball based on the trajectory of its movements.
PISEP^2: Pseudo Image Sequence Evolution based 3D Pose Prediction
A skeletal representation is proposed by transforming the joint coordinate sequence into an image sequence, which can model the different correlations of different joints, and a novel inference network is proposed to predict all future poses in one step by decoupling the decoders in a non-recursive manner.
Object-centered Fourier Motion Estimation and Segment-Transformation Prediction
An objectcentered movement estimation, frame prediction, and correction framework using frequency-domain approaches that transform single objects based on estimated translation and rotation speeds which are correct using a learned encoding of the past.


Location Dependency in Video Prediction
The results indicate that encoding location-dependent features is crucial for the task of video prediction, and the proposed methods significantly outperform spatially invariant models.
Video Ladder Networks
The basic version of VLN is extended to incorporate ResNet-style residual blocks in the encoder and decoder, which help improving the prediction results.
Video Pixel Networks
A probabilistic video model, the Video Pixel Network (VPN), that estimates the discrete joint distribution of the raw pixel values in a video and generalizes to the motion of novel objects.
Modeling spatiotemporal information with convolutional gated networks
The developed convolutional version of the bilinear model for predicting spatiotemporal data halved the 4-step prediction loss while reducing the number of parameters by a factor of 159 compared to the original model.
Modeling Deep Temporal Dependencies with Recurrent "Grammar Cells"
This work shows how a bi-linear model of transformations, such as a gated autoencoder, can be turned into a recurrent network, by training it to predict future frames from the current one and the inferred transformation using backprop-through-time.
Extension of phase correlation to subpixel registration
It is shown that for downsampled images the signal power in the phase correlation is not concentrated in a single peak, but rather in several coherent peaks mostly adjacent to each other.
An FFT-based technique for translation, rotation, and scale-invariant image registration
This correspondence discusses an extension of the well-known phase correlation technique to cover translation, rotation, and scaling, which shows excellent robustness against random noise.
Learning to relate images.
  • R. Memisevic
  • Computer Science, Medicine
    IEEE transactions on pattern analysis and machine intelligence
  • 2013
This paper reviews the recent work on relational feature learning, and provides an analysis of the role that multiplicative interactions play in learning to encode relations, and discusses how square-pooling and complex cell models can be viewed as a way to representmultiplicative interactions and thereby as a ways to encoded relations.