• Corpus ID: 237347000

Drop-DTW: Aligning Common Signal Between Sequences While Dropping Outliers

  title={Drop-DTW: Aligning Common Signal Between Sequences While Dropping Outliers},
  author={Nikita Dvornik and Isma Hadji and Konstantinos G. Derpanis and Animesh Garg and Allan D. Jepson},
In this work, we consider the problem of sequence-to-sequence alignment for signals containing outliers. Assuming the absence of outliers, the standard Dynamic Time Warping (DTW) algorithm efficiently computes the optimal alignment between two (generally) variable-length sequences. While DTW is robust to temporal shifts and dilations of the signal, it fails to align sequences in a meaningful way in the presence of outliers that can be arbitrarily interspersed in the sequences. To address this… 

Figures and Tables from this paper

Video-Text Representation Learning via Differentiable Weak Temporal Alignment

This paper proposes a novel multi-modal self-supervised framework Video-Text Temporally Weak Alignment-based Contrastive Learning (VT-TWINS) to capture significant information from noisy and weakly correlated data using a variant of Dynamic Time Warping (DTW).

Temporal Alignment Networks for Long-term Video

A temporal alignment network that ingests long term video sequences, and associated text sentences, in order to determine if a sentence is alignable with the video, and if it is alignedable, then determine its alignment is proposed.

Segregational Soft Dynamic Time Warping and Its Application to Action Prediction

The superiority of the proposed algorithms lies in the combination of the soft-minimum operator and the relaxation of the boundary constraints of S- DTW, with the segregational capabilities of OE-DTW and OBE-DTw, resulting in better and differentiable action alignment in the case of continuous, unsegmented videos.

Set-Supervised Action Learning in Procedural Task Videos via Pairwise Order Consistency

  • Zijia LuEhsan Elhamifar
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
An attention-based method with a new Pairwise Ordering Consistency (POC) loss that encourages that for each common action pair in two videos of the same task, the attentions of actions follow a similar ordering, which significantly improves the state of the art.

Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization

It is argued that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in WSAL and helps identify coherent action instances and achieves state-of-the-art performance on two popular benchmarks.

Graph2Vid: Flow graph to Video Grounding forWeakly-supervised Multi-Step Localization

A new algorithm is proposed - Graph2Vid - that infers the actual ordering of steps in the video and simultaneously localizes them, and is both more efficient than the baselines and yields strong step localization results, without the need for step order annotation.

Semi-Weakly-Supervised Learning of Complex Actions from Instructional Task Videos

A Soft Restricted Edit (SRE) loss is developed to encourage small variations between the predicted transcripts of unlabeled videos and ground-truth transcripts of the weakly-labeled videos of the same task.

Learning Spatial-Preserved Skeleton Representations for Few-Shot Action Recognition

This work proposes a novel spatial matching strategy consisting of spatial disentanglement and spatial activation that can be effectively inserted into existing temporal alignment frameworks, achieving considerable performance improvements as well as inherent explainability.

P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision

This work removes the need for expensive temporal video annotations and proposes a weakly supervised approach by learning from natural language instructions, based on a transformer equipped with a memory module, which maps the start and goal observations to a sequence of plausible actions.

Data-Driven Oracle Bone Rejoining: A Dataset and Practical Self-Supervised Learning Scheme

This work collects a real-world dataset for rejoining Oracle Bone fragments, namely OB-Rejoin, and proposes a practical Self-Supervised Splicing Network, S3-Net, which is proposed to rejoin the OB fragments based on shape similarity of their borderlines.



Representation Learning via Global Temporal Alignment and Cycle-Consistency

A weakly supervised method for representation learning based on aligning temporal sequences of the same process as well as two applications of the temporal alignment framework, namely 3D pose reconstruction and fine-grained audio/visual retrieval.

Soft-DTW: a Differentiable Loss Function for Time-Series

This work takes advantage of a smoothed formulation of DTW, called soft-DTW, that computes the soft-minimum of all alignment costs, and shows that this regularization is particularly well suited to average and cluster time series under the DTW geometry.

D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation

The proposed Discriminative Differentiable Dynamic Time Warping (D3TW) innovatively solves sequence alignment with discriminative modeling and end-to-end training, which substantially improves the performance in weakly supervised action alignment and segmentation tasks.

Deep Canonical Time Warping

The Deep Canonical Time Warping (DCTW), a method which automatically learns complex non-linear representations of multiple time-series, generated such that they are highly correlated, and temporally in alignment, is presented.

Slow and Steady Feature Analysis: Higher Order Temporal Coherence in Video

A convolutional neural network is trained with a regularizer on tuples of sequential frames from unlabeled video to generalize slow feature analysis to "steady" feature analysis and impose a prior that higher order derivatives in the learned feature space must be small.

Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification

This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.

Canonical Time Warping for Alignment of Human Behavior

Alignment of time series is an important problem to solve in many scientific disciplines. In particular, temporal alignment of two or more subjects performing similar activities is a challenging

Temporal Cycle-Consistency Learning

It is shown that the learned embeddings enable few-shot classification of these action phases, significantly reducing the supervised training requirements; and TCC is complementary to other methods of self-supervised learning in videos, such as Shuffle and Learn and Time-Contrastive Networks.

Few-Shot Video Classification via Temporal Alignment

This paper proposes the Ordered Temporal Alignment Module (OTAM), a novel few-shot learning framework that can learn to classify a previously unseen video and demonstrates that the model leads to significant improvement of few- shot video classification over a wide range of competitive baselines and outperforms state-of-the-art benchmarks by a large margin.

Unsupervised Representation Learning by Sorting Sequences

The experimental results show that the unsupervised representation learning approach using videos without semantic labels compares favorably against state-of-the-art methods on action recognition, image classification, and object detection tasks.