• Corpus ID: 214802821

Deep Multimodal Feature Encoding for Video Ordering

  title={Deep Multimodal Feature Encoding for Video Ordering},
  author={Vivek Sharma and Makarand Tapaswi and Rainer Stiefelhagen},
True understanding of videos comes from a joint analysis of all its modalities: the video frames, the audio track, and any accompanying text such as closed captions. We present a way to learn a compact multimodal feature representation that encodes all these modalities. Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline. To this end, we create a new multimodal dataset for temporal ordering that consists of… 

Figures and Tables from this paper

Large Scale Holistic Video Understanding

A new spatio-temporal deep neural network architecture called "Holistic Appearance and Temporal Network"~(HATNet) that builds on fusing 2D and 3D architectures into one by combining intermediate representations of appearance and temporal cues is introduced.

Look Before you Speak: Visually Contextualized Utterances

A new visually conditioned Future Utterance Prediction task that involves predicting the next utterance in a video, using both visual frames and transcribed speech as context, which consists of a novel co-attentional multimodal video transformer that outperforms baselines that use textual inputs alone.

Self-supervised Face Representation Learning

This thesis investigates fine-tuning deep face features in a self-supervised manner for discriminative face representation learning, wherein we develop methods to automatically generate pseudo-labels

Test of Time: Instilling Video-Language Models with a Sense of Time

This paper proposes a temporal adaptation recipe on top of one video-language model, VideoCLIP, based on post-pretraining on a small amount of video-text data, and conducts a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness.

Vi2CLR: Video and Image for Visual Contrastive Learning of Representation

This paper shows how a joint self-supervised visual clustering and instance similarity learning with 2D (image) and 3D (video) CovNet encoders yields such robust and near to supervised learning performance.

Clustering based Contrastive Learning for Improving Face Representations

This work presents Clustering-based Contrastive Learning (CCL), a new clustering- based representation learning approach that uses labels obtained from clustering along with video constraints to learn discriminative face features.

Short-time deep-learning based source separation for speech enhancement in reverberant environments with beamforming

A short-term source separation method based on a temporal convolutional network in combination with compact bilinear pooling is presented and provides a surprising result that contradicts the widely adopted practice of using multi-condition trained ASR systems and reinforces the use of speech enhancement methods for user profiling in HRI environments.

SplitNN-driven Vertical Partitioning

This work introduces SplitNN-driven Vertical Partitioning, a configuration of a distributed deep learning method called SplitNN to facilitate learning from vertically distributed features to tackle the specific challenges posed by vertically split datasets.



Deep Temporal Linear Encoding Networks

A new video representation, called temporal linear encoding (TLE) and embedded inside of CNNs as a new layer, which captures the appearance and motion throughout entire videos, and outperforms current state-of-the-art methods on both datasets.

Slow and Steady Feature Analysis: Higher Order Temporal Coherence in Video

A convolutional neural network is trained with a regularizer on tuples of sequential frames from unlabeled video to generalize slow feature analysis to "steady" feature analysis and impose a prior that higher order derivatives in the learned feature space must be small.

Learnable pooling with Context Gating for video classification

This work explores combinations of learnable pooling techniques such as Soft Bag-of-words, Fisher Vectors, NetVLAD, GRU and LSTM to aggregate video features over time, and introduces a learnable non-linear network unit, named Context Gating, aiming at modeling in-terdependencies between features.

Attention-Based Multimodal Fusion for Video Description

A multimodal attention model that can selectively utilize features from different modalities for each word in the output description is introduced that outperforms state-of-the-art methods on two standard datasets: YouTube2Text and MSR-VTT.

A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering

This task is not solvable by a language model alone, and the model combining 2D and 3D visual information indeed provides the best result, all models perform significantly worse than human-level.

Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification

This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.

Describing Videos using Multi-modal Fusion

Experimental results show the effectiveness of multi-modal fusion encoder trained in the end-to-end framework, which achieved top performance in both common metrics evaluation and human evaluation.

Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

A novel multimodal framework that instantiates multiple instance learning for audio- visual representation learning is proposed and it is shown that the learnt representations are useful for classifying events and localizing their characteristic audio-visual elements.

See, Hear, and Read: Deep Aligned Representations

This work utilizes large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language, and jointly train a deep convolutional network for aligned representation learning.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.