Attentive Semantic Video Generation Using Captions

@article{Marwah2017AttentiveSV,
  title={Attentive Semantic Video Generation Using Captions},
  author={Tanya Marwah and Gaurav Mittal and Vineeth N. Balasubramanian},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={1435-1443}
}
This paper proposes a network architecture to perform variable length semantic video generation using captions. We adopt a new perspective towards video generation where we allow the captions to be combined with the long-term and short-term dependencies between video frames and thus generate a video in an incremental manner. Our experiments demonstrate our network architecture’s ability to distinguish between objects, actions and interactions in a video and combine them to generate videos for… 

Figures and Tables from this paper

From Recognition to Generation Using Deep Learning: A Case Study with Video Generation
TLDR
This paper proposes two network architectures to perform video generation from captions using Variational Autoencoders and shows that the network’s ability to learn a latent representation allows it to generate videos in an unsupervised manner and perform other tasks such as action recognition.
Conditional Video Generation Using Action-Appearance Captions
TLDR
Conditional Flow and Texture GAN (CFT-GAN), a GAN-based video generation method from action-appearance captions, is presented, able to successfully generate videos containing the action and appearances indicated in the captions.
Video Generation from Text Employing Latent Path Construction for Temporal Modeling
TLDR
This paper is the very first work on the text (free-form sentences) to video generation on more realistic video datasets like Actor and Action Dataset (A2D) or UCF101 and provides quantitative and qualitative results to support the arguments and show the superiority of the method over well-known baselines.
Imagine This! Scripts to Compositions to Videos
TLDR
This work presents the Composition, Retrieval, and Fusion Network (CRAFT), a model capable of learning knowledge from video-caption data and applying it while generating videos from novel captions, and evaluates CRAFT on semantic fidelity to caption, composition consistency, and visual quality.
Compositional Video Synthesis with Action Graphs
TLDR
This work introduces a generative model (AG2Vid) based on Action Graphs, a natural and convenient structure that represents the dynamics of actions between objects over time, allowing for more accurate generation of videos.
Multi-person/Group Interactive Video Generation
TLDR
This work proposes a novel human motion generation framework which can simultaneously consider the temporal coherence of each individual action, and consists of two components: Semantic Extractor, Motion Generator.
Deep Video Generation, Prediction and Completion of Human Action Sequences
TLDR
This paper proposes a general, two-stage deep framework to generate human action videos with no constraints or arbitrary number of constraints, which uniformly address the three problems: video generation given no input frames, video prediction given the first few frames, and video completionGiven the first and last frames.
Learning Semantic-Aware Dynamics for Video Prediction
TLDR
An architecture and training scheme to predict video frames by explicitly modeling dis-occlusions and capturing the evolution of semantically consistent regions in the video is proposed, which evaluates on video prediction benchmarks.
Controllable Video Generation with Sparse Trajectories
TLDR
This work presents a conditional video generation model that allows detailed control over the motion of the generated video and proposes a training paradigm that calculate trajectories from video clips, which eliminated the need of annotated training data.
VideoGPT: Video Generation using VQ-VAE and Transformers
TLDR
Despite the simplicity in formulation and ease of training, the proposed architecture is able to generate samples competitive with state of theart GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural images from UCF-101 and Tumbler GIF Dataset (TGIF).
...
...

References

SHOWING 1-10 OF 35 REFERENCES
Generating Videos with Scene Dynamics
TLDR
A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines.
Generating Images from Captions with Attention
TLDR
It is demonstrated that the proposed model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset.
Video Pixel Networks
TLDR
A probabilistic video model, the Video Pixel Network (VPN), that estimates the discrete joint distribution of the raw pixel values in a video and generalizes to the motion of novel objects.
Unsupervised Learning of Video Representations using LSTMs
TLDR
This work uses Long Short Term Memory networks to learn representations of video sequences and evaluates the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets.
Temporal Generative Adversarial Nets
TLDR
A generative model that can learn a semantic representation of unlabelled videos, and is capable of generating consistent videos is proposed, which can handle a wider range of applications including the generation of a long sequence, frame interpolation, and the use of pre-trained models.
Skip-Thought Vectors
We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
TLDR
An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
TLDR
This work introduces UCF101 which is currently the largest dataset of human actions and provides baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%.
DRAW: A Recurrent Neural Network For Image Generation
TLDR
The Deep Recurrent Attentive Writer neural network architecture for image generation substantially improves on the state of the art for generative models on MNIST, and, when trained on the Street View House Numbers dataset, it generates images that cannot be distinguished from real data with the naked eye.
Temporal Generative Adversarial Nets with Singular Value Clipping
TLDR
A generative model which can learn a semantic representation of unlabeled videos, and is capable of generating videos, is proposed, and a novel method to train it stably in an end-to-end manner is proposed.
...
...