Learning Temporal Embeddings for Complex Video Analysis

  title={Learning Temporal Embeddings for Complex Video Analysis},
  author={Vignesh Ramanathan and Kevin D. Tang and Greg Mori and Li Fei-Fei},
  journal={2015 IEEE International Conference on Computer Vision (ICCV)},
In this paper, we propose to learn temporal embeddings of video frames for complex video analysis. Large quantities of unlabeled video data can be easily obtained from the Internet. These videos possess the implicit weak label that they are sequences of temporally and semantically coherent images. We leverage this information to learn temporal embeddings for video frames by associating frames with the temporal context that they appear in. To do this, we propose a scheme for incorporating… 

Learning Deep Intrinsic Video Representation by Exploring Temporal Coherence and Graph Structure

This work proposes a triplet sampling mechanism to encode the local temporal relationship of adjacent frames based on their deep representations, and incorporates the graph structure of the video, as a priori, to holistically preserve the inherent correlations among video frames.

Unsupervised Learning of Action Classes With Continuous Temporal Embedding

This work uses a continuous temporal embedding of framewise features to benefit from the sequential nature of activities and identifies clusters of temporal segments across all videos that correspond to semantic meaningful action classes.

Generating Videos with Scene Dynamics

A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines.

Learning Temporal Regularity in Video Sequences

This work proposes two methods that are built upon the autoencoders for their ability to work with little to no supervision, and builds a fully convolutional feed-forward autoencoder to learn both the local features and the classifiers as an end-to-end learning framework.

VizObj2Vec: Contextual Representation Learning for Visual Objects in Video-frames

  • A. FarhanM. Hossain
  • Computer Science
    2020 IEEE International Conference on Big Data (Big Data)
  • 2020
A distributed representation model, vizObj2Vec, is presented that leverages the contexts of visual objects learned from spatiotemporal placements of the objects in the video-frames to construct object-embeddings and enables computation of contextual similarity.

Connectionist Temporal Modeling for Weakly Supervised Action Labeling

The Extended Connectionist Temporal Classification (ECTC) framework is introduced to efficiently evaluate all possible alignments via dynamic programming and explicitly enforce their consistency with frame-to-frame visual similarities.

Discovering Latent Discriminative Patterns for Multi-Mode Event Representation

This paper proposes a compact event representation method that can concisely describe the inner modes of events, and adopts its event representation for representative event parts mining, which can highlight the visual topics of events and remarkably prune the raw videos.

Object-Centric Representation Learning from Unlabeled Videos

This work introduces a novel object-centric approach to temporal coherence that encourages similar representations to be learned for object-like regions segmented from nearby frames in a deep convolutional neural network representation.

STAT: Spatial-Temporal Attention Mechanism for Video Captioning

The proposed spatial-temporal attention mechanism (STAT) within an encoder-decoder neural network for video captioning successfully takes into account both the spatial and temporal structures in a video, so it makes the decoder to automatically select the significant regions in the most relevant temporal segments for word prediction.

Semantic Video Trailers

This paper proposes an unsupervised label propagation approach for query-based video summarization that effectively captures the multimodal semantics of queries and videos using state-of-the-art deep neural networks and creates a summary that is both semantically coherent and visually attractive.



Video (language) modeling: a baseline for generative models of natural videos

For the first time, it is shown that a strong baseline model for unsupervised feature learning using video data can predict non-trivial motions over short video sequences.

Unsupervised Learning of Video Representations using LSTMs

This work uses Long Short Term Memory networks to learn representations of video sequences and evaluates the representations by finetuning them for a supervised learning problem - human action recognition on the UCF-101 and HMDB-51 datasets.

A discriminative CNN video representation for event detection

This paper proposes using a set of latent concept descriptors as the frame descriptor, which enriches visual information while keeping it computationally affordable, in a new state-of-the-art performance in event detection over the largest video datasets.

Learning realistic human actions from movies

A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.

Convolutional Learning of Spatio-temporal Features

A model that learns latent representations of image sequences from pairs of successive images is introduced, allowing it to scale to realistic image sizes whilst using a compact parametrization.

Large-Scale Video Classification with Convolutional Neural Networks

This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic.

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

This work introduces a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data and introduces a structured max-margin objective that allows this model to explicitly associate fragments across modalities.

Video Event Understanding Using Natural Language Descriptions

A topic-based semantic relatedness measure is introduced between a video description and an action and role label, and incorporated into a posterior regularization objective that matches the state-of-the-art method on the TRECVID-MED11 event kit, despite weaker supervision.

Action Recognition with Stacked Fisher Vectors

Experimental results demonstrate the effectiveness of SFV, and the combination of the traditional FV and SFV outperforms state-of-the-art methods on these datasets with a large margin.