Corpus ID: 11299888

Disentangling Space and Time in Video with Hierarchical Variational Auto-encoders

@article{Grathwohl2016DisentanglingSA,
  title={Disentangling Space and Time in Video with Hierarchical Variational Auto-encoders},
  author={Will Grathwohl and Aaron Wilson},
  journal={ArXiv},
  year={2016},
  volume={abs/1612.04440}
}
There are many forms of feature information present in video data. [...] Key Method Our approach leverages a deep generative model with a factored prior distribution that encodes properties of temporal invariances in the hidden feature set. Learning is achieved via variational inference. We present results of learning identity and pose information on a dataset of moving characters as well as a dataset of rotating 3D objects. Our experimental results demonstrate our model's success in factoring its…Expand
Disentangling Representations using Gaussian Processes in Variational Autoencoders for Video Prediction
TLDR
The experiments show quantitatively that the combination of the improved disentangled representations with the novel loss function enable MGP-VAE to outperform the state-of-theart in video prediction. Expand
Unsupervised Learning of Disentangled Representations from Video
We present a new model DrNET that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representationExpand
Adversarial Disentanglement with Grouped Observations
TLDR
The training objective is augmented to minimize an appropriately defined mutual information term in an adversarial way and the resulting method can efficiently separate the content and style related attributes and generalizes to unseen data. Expand
Isolating Sources of Disentanglement in VAEs
TLDR
A decomposition of the variational lower bound is shown that can be used to explain the success of the β-VAE in learning disentangled representations, and a new information-theoretic disentanglement metric is proposed, which is classifier-free and generalizable to arbitrarily-distributed and non-scalar latent variables. Expand
Isolating Sources of Disentanglement in Variational Autoencoders
We decompose the evidence lower bound to show the existence of a term measuring the total correlation between latent variables. We use this to motivate our $\beta$-TCVAE (Total CorrelationExpand
Topographic VAEs learn Equivariant Capsules
TLDR
The Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables is introduced and it is shown that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST. Expand
Variational encoding of complex dynamics.
TLDR
The use of a time-lagged VAE, or variational dynamics encoder (VDE), to reduce complex, nonlinear processes to a single embedding with high fidelity to the underlying dynamics and how the VDE is able to capture nontrivial dynamics in a variety of examples. Expand
Disentangling Video with Independent Prediction
We propose an unsupervised variational model for disentangling video into independent factors, i.e. each factor's future can be predicted from its past without considering the others. We show thatExpand
DATUM: Dotted Attention Temporal Upscaling Method
Computational simulations frequently only save a subset of their time slices, e.g., running for one thousand cycles, but saving only fifty time slices. With this work we consider the problem ofExpand
Estimating Predictive Rate–Distortion Curves via Neural Variational Inference
TLDR
Neural Predictive Rate–Distortion (NPRD), an estimation method that scales to processes such as natural languages, leveraging the universal approximation capabilities of neural networks and improving on bounds provided by clustering sequences. Expand
...
1
2
...

References

SHOWING 1-10 OF 17 REFERENCES
DL-SFA: Deeply-Learned Slow Feature Analysis for Action Recognition
TLDR
This paper uses a two-layered SFA learning structure with 3D convolution and max pooling operations to scale up the method to large inputs and capture abstract and structural features from the video. Expand
Unsupervised Learning of Video Representations using LSTMs
TLDR
This work uses Long Short Term Memory networks to learn representations of video sequences and evaluates the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets. Expand
Early Visual Concept Learning with Unsupervised Deep Learning
TLDR
An unsupervised approach for learning disentangled representations of the underlying factors of variation by applying the same learning pressures as have been suggested to act in the ventral visual stream in the brain is proposed. Expand
Deep Convolutional Inverse Graphics Network
This paper presents the Deep Convolution Inverse Graphics Network (DC-IGN), a model that aims to learn an interpretable representation of images, disentangled with respect to three-dimensional sceneExpand
Learning Spatiotemporal Features with 3D Convolutional Networks
TLDR
The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks. Expand
Auto-Encoding Variational Bayes
TLDR
A stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case is introduced. Expand
How transferable are features in deep neural networks?
TLDR
This paper quantifies the generality versus specificity of neurons in each layer of a deep convolutional neural network and reports a few surprising results, including that initializing a network with transferred features from almost any number of layers can produce a boost to generalization that lingers even after fine-tuning to the target dataset. Expand
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition. Expand
Slow Feature Analysis: Unsupervised Learning of Invariances
TLDR
Slow feature analysis (SFA) is a new method for learning invariant or slowly varying features from a vectorial input signal that is guaranteed to find the optimal solution within a family of functions directly and can learn to extract a large number of decor-related features, which are ordered by their degree of invariance. Expand
Deconvolutional networks
TLDR
This work presents a learning framework where features that capture these mid-level cues spontaneously emerge from image data, based on the convolutional decomposition of images under a spar-sity constraint and is totally unsupervised. Expand
...
1
2
...