Deep Multimodal Feature Encoding for Video Ordering
@article{Sharma2020DeepMF, title={Deep Multimodal Feature Encoding for Video Ordering}, author={Vivek Sharma and Makarand Tapaswi and Rainer Stiefelhagen}, journal={ArXiv}, year={2020}, volume={abs/2004.02205} }
True understanding of videos comes from a joint analysis of all its modalities: the video frames, the audio track, and any accompanying text such as closed captions. We present a way to learn a compact multimodal feature representation that encodes all these modalities. Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline. To this end, we create a new multimodal dataset for temporal ordering that consists of…
Figures and Tables from this paper
8 Citations
Large Scale Holistic Video Understanding
- Computer ScienceECCV
- 2020
A new spatio-temporal deep neural network architecture called "Holistic Appearance and Temporal Network"~(HATNet) that builds on fusing 2D and 3D architectures into one by combining intermediate representations of appearance and temporal cues is introduced.
Look Before you Speak: Visually Contextualized Utterances
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
A new visually conditioned Future Utterance Prediction task that involves predicting the next utterance in a video, using both visual frames and transcribed speech as context, which consists of a novel co-attentional multimodal video transformer that outperforms baselines that use textual inputs alone.
Self-supervised Face Representation Learning
- Computer Science
- 2020
This thesis investigates fine-tuning deep face features in a self-supervised manner for discriminative face representation learning, wherein we develop methods to automatically generate pseudo-labels…
Test of Time: Instilling Video-Language Models with a Sense of Time
- Computer ScienceArXiv
- 2023
This paper proposes a temporal adaptation recipe on top of one video-language model, VideoCLIP, based on post-pretraining on a small amount of video-text data, and conducts a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require a varying degree of time awareness.
Vi2CLR: Video and Image for Visual Contrastive Learning of Representation
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
This paper shows how a joint self-supervised visual clustering and instance similarity learning with 2D (image) and 3D (video) CovNet encoders yields such robust and near to supervised learning performance.
Clustering based Contrastive Learning for Improving Face Representations
- Computer Science2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)
- 2020
This work presents Clustering-based Contrastive Learning (CCL), a new clustering- based representation learning approach that uses labels obtained from clustering along with video constraints to learn discriminative face features.
Short-time deep-learning based source separation for speech enhancement in reverberant environments with beamforming
- Computer Science
- 2020
A short-term source separation method based on a temporal convolutional network in combination with compact bilinear pooling is presented and provides a surprising result that contradicts the widely adopted practice of using multi-condition trained ASR systems and reinforces the use of speech enhancement methods for user profiling in HRI environments.
SplitNN-driven Vertical Partitioning
- Computer ScienceArXiv
- 2020
This work introduces SplitNN-driven Vertical Partitioning, a configuration of a distributed deep learning method called SplitNN to facilitate learning from vertically distributed features to tackle the specific challenges posed by vertically split datasets.
References
SHOWING 1-10 OF 69 REFERENCES
Deep Temporal Linear Encoding Networks
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
A new video representation, called temporal linear encoding (TLE) and embedded inside of CNNs as a new layer, which captures the appearance and motion throughout entire videos, and outperforms current state-of-the-art methods on both datasets.
Slow and Steady Feature Analysis: Higher Order Temporal Coherence in Video
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
A convolutional neural network is trained with a regularizer on tuples of sequential frames from unlabeled video to generalize slow feature analysis to "steady" feature analysis and impose a prior that higher order derivatives in the learned feature space must be small.
Learnable pooling with Context Gating for video classification
- Computer ScienceArXiv
- 2017
This work explores combinations of learnable pooling techniques such as Soft Bag-of-words, Fisher Vectors, NetVLAD, GRU and LSTM to aggregate video features over time, and introduces a learnable non-linear network unit, named Context Gating, aiming at modeling in-terdependencies between features.
Attention-Based Multimodal Fusion for Video Description
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
A multimodal attention model that can selectively utilize features from different modalities for each word in the output description is introduced that outperforms state-of-the-art methods on two standard datasets: YouTube2Text and MSR-VTT.
A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
This task is not solvable by a language model alone, and the model combining 2D and 3D visual information indeed provides the best result, all models perform significantly worse than human-level.
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
- Computer ScienceECCV
- 2016
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
Describing Videos using Multi-modal Fusion
- Computer ScienceACM Multimedia
- 2016
Experimental results show the effectiveness of multi-modal fusion encoder trained in the end-to-end framework, which achieved top performance in both common metrics evaluation and human evaluation.
Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events
- Computer ScienceCVPR Workshops
- 2018
A novel multimodal framework that instantiates multiple instance learning for audio- visual representation learning is proposed and it is shown that the learnt representations are useful for classifying events and localizing their characteristic audio-visual elements.
See, Hear, and Read: Deep Aligned Representations
- Computer ScienceArXiv
- 2017
This work utilizes large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language, and jointly train a deep convolutional network for aligned representation learning.
Two-Stream Convolutional Networks for Action Recognition in Videos
- Computer ScienceNIPS
- 2014
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.