Learning by Aligning Videos in Time

  title={Learning by Aligning Videos in Time},
  author={Sanjay Haresh and Sateesh Kumar and Huseyin Coskun and Shahram Najam Syed and Andrey Konin and M. Zeeshan Zia and Quoc-Huy Tran},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task, while exploiting both frame-level and video-level information. We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network. Specifically, the temporal alignment loss (i.e., Soft-DTW) aims for the minimum cost for temporally aligning videos in the embedding space… 
Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning
This paper introduces a novel contrastive action representation learning (CARL) framework to learn frame-wise action representations, especially for long videos, in a self-supervised manner and outperforms previous state-of-the-art methods on video alignment and frame retrieval tasks.
Video-Text Representation Learning via Differentiable Weak Temporal Alignment
This paper proposes a novel multi-modal self-supervised framework, VT-TWINS, to capture significant information from noisy and weakly correlated data using a variant of Dynamic Time Warping (DTW), and applies a contrastive learning scheme to learn feature representations on weakly correlation data.
Learning to Align Sequential Actions in the Wild
This paper proposes an approach to enforce temporal priors on the optimal transport matrix, which leverages temporal consistency, while allowing for variations in the order of actions.
Context-Aware Sequence Alignment using 4D Skeletal Augmentation
This work proposes a novel context-aware self-supervised learning architecture that employs self-att attention and cross-attention mechanisms to incorporate the spatial and temporal context of human actions, which can solve the temporal discontinuity problem.
Segregational Soft Dynamic Time Warping and Its Application to Action Prediction
The superiority of the proposed algorithms lies in the combination of the soft-minimum operator and the relaxation of the boundary constraints of S- DTW, with the segregational capabilities of OE-DTW and OBE-DTw, resulting in better and differentiable action alignment in the case of continuous, unsegmented videos.
Action Prediction During Human-Object Interaction Based on DTW and Early Fusion of Human and Object Representations
The experimental results reveal the importance of the fusion of humanand object-centered action representations in the accuracy of action prediction and demonstrate that the proposed approach achieves significantly higher action prediction accuracy compared to competitive methods.
Weakly-Supervised Online Action Segmentation in Multi-View Instructional Videos
A framework to segment streaming videos online at test time using Dynamic Programming and show its advantages over greedy sliding window approach and improves the framework by introducing the Online-Offline Discrepancy Loss (OODL) to encourage the segmentation results to have a higher temporal consistency.
Learning ABCs: Approximate Bijective Correspondence for isolating factors of variation with weak supervision
A novel algorithm that utilizes a weak form of supervision where the data is partitioned into sets according to certain inactive factors of variation which are invariant across elements of each set is proposed.
SeqMatchNet: Contrastive Learning with Sequence Matching for Place Recognition & Relocalization
For the first time, this work bridges the gap between single im7 age representation learning and sequence matching through SeqMatchNet which transforms the single image descriptors such that they become more responsive to the sequence matching metric.
Unsupervised Activity Segmentation by Joint Representation Learning and Online Clustering
This work presents a novel approach for unsupervised activity segmentation which uses video frame clustering as a pretext task and simultaneously performs representation learning and online clustering and leverages temporal information in videos by employing temporal optimal transport.


Few-Shot Video Classification via Temporal Alignment
This paper proposes the Ordered Temporal Alignment Module (OTAM), a novel few-shot learning framework that can learn to classify a previously unseen video and demonstrates that the model leads to significant improvement of few- shot video classification over a wide range of competitive baselines and outperforms state-of-the-art benchmarks by a large margin.
Temporal Cycle-Consistency Learning
It is shown that the learned embeddings enable few-shot classification of these action phases, significantly reducing the supervised training requirements; and TCC is complementary to other methods of self-supervised learning in videos, such as Shuffle and Learn and Time-Contrastive Networks.
Shuffle and Attend: Video Domain Adaptation
This work proposes an attention mechanism which focuses on more discriminative clips and directly optimizes for video-level alignment and proposes to use the clip order prediction as an auxiliary task, which encourages learning of representations which focus on the humans and objects involved in the actions.
Aligning Videos in Space and Time
The proposed novel alignment procedure can successfully learn to correspond semantically similar patches across videos, and learns representations that are sensitive to object and action states.
Video Representation Learning by Dense Predictive Coding
With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101 and HMDB51, outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
DynamoNet: Dynamic Action and Motion Network
A novel unified spatio-temporal 3D-CNN architecture (DynamoNet) that jointly optimizes the video classification and learning motion representation by predicting future frames as a multi-task learning problem is introduced.
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles
This paper introduces a new self-supervised task called Space-Time Cubic Puzzles, which requires a network to arrange permuted 3D spatio-temporal crops and learns both spatial appearance and temporal relation of video frames, which is the final goal.
Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction
A self-supervised spatiotemporal learning technique which leverages the chronological order of videos to learn the spatiotmporal representation of the video by predicting the order of shuffled clips from the video.
D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation
The proposed Discriminative Differentiable Dynamic Time Warping (D3TW) innovatively solves sequence alignment with discriminative modeling and end-to-end training, which substantially improves the performance in weakly supervised action alignment and segmentation tasks.