Context-Aware Sequence Alignment using 4D Skeletal Augmentation

@article{Kwon2022ContextAwareSA,
  title={Context-Aware Sequence Alignment using 4D Skeletal Augmentation},
  author={Taein Kwon and Bugra Tekin and Siyu Tang and Marc Pollefeys},
  journal={ArXiv},
  year={2022},
  volume={abs/2204.12223}
}
Temporal alignment of fine-grained human actions in videos is important for numerous applications in computer vision, robotics, and mixed reality. State-of-the-art methods directly learn image-based embedding space by leveraging powerful deep convolutional neural networks. While being straightforward, their results are far from satisfactory, the aligned videos exhibit severe temporal discontinuity without additional post-processing steps. The recent advancements in human body and hand pose… 

Figures from this paper

References

SHOWING 1-10 OF 70 REFERENCES
Self-Supervised Learning of Pose Embeddings from Spatiotemporal Relations in Videos
TLDR
The key idea is to combine temporal ordering and spatial placement estimation as auxiliary tasks for learning pose similarities in a Siamese convolutional network for self-supervised learning of pose embeddings.
Self-supervised 3D Skeleton Action Representation Learning with Motion Consistency and Continuity
TLDR
This work proposes a novel SSL method to learn the 3D skeleton representation in an efficacious way by constructing a positive clip and a negative clip of the sampled action sequence that encourages the positive pairs closer while pushing the negative pairs to force the network to learning the intrinsic dynamic motion consistency information.
Learning by Aligning Videos in Time
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task, while exploiting both frame-level and video-level information. We leverage a
Representation Learning via Global Temporal Alignment and Cycle-Consistency
TLDR
A weakly supervised method for representation learning based on aligning temporal sequences of the same process as well as two applications of the temporal alignment framework, namely 3D pose reconstruction and fine-grained audio/visual retrieval.
The IKEA ASM Dataset: Understanding People Assembling Furniture through Actions, Objects and Pose
TLDR
IKEA ASM is introduced—a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human poses that enables the development of holistic methods, which integrate multi-modal and multi-View data to better perform on these tasks.
DynamoNet: Dynamic Action and Motion Network
TLDR
A novel unified spatio-temporal 3D-CNN architecture (DynamoNet) that jointly optimizes the video classification and learning motion representation by predicting future frames as a multi-task learning problem is introduced.
Spatiotemporal Contrastive Video Representation Learning
TLDR
This work proposes a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames, and proposes a sampling-based temporal augmentation methods to avoid overly enforcing invariance on clips that are distant in time.
Human Motion Analysis with Deep Metric Learning
TLDR
A novel metric learning objective based on a triplet architecture and Maximum Mean Discrepancy is proposed and a novel deep architecture based on attentive recurrent neural networks is proposed, which enforces a better separation within the learned embedding space of the different motion categories by means of the associated distribution moments.
Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction
TLDR
A self-supervised spatiotemporal learning technique which leverages the chronological order of videos to learn the spatiotmporal representation of the video by predicting the order of shuffled clips from the video.
Temporal Cycle-Consistency Learning
TLDR
It is shown that the learned embeddings enable few-shot classification of these action phases, significantly reducing the supervised training requirements; and TCC is complementary to other methods of self-supervised learning in videos, such as Shuffle and Learn and Time-Contrastive Networks.
...
...