Exploiting Instance-based Mixed Sampling via Auxiliary Source Domain Supervision for Domain-adaptive Action Detection

  title={Exploiting Instance-based Mixed Sampling via Auxiliary Source Domain Supervision for Domain-adaptive Action Detection},
  author={Yifan Lu and Gurkirt Singh and Suman Saha and Luc Van Gool},
We propose a novel domain adaptive action detection approach and a new adaptation protocol that leverages the recent advancements in image-level unsupervised domain adaptation (UDA) techniques and handle vagaries of instance-level video data. Self-training combined with cross-domain mixed sampling has shown remarkable performance gain in semantic segmentation in UDA (unsu-pervised domain adaptation) context. Motivated by this fact, we propose an approach for human action detection in videos… 



Unsupervised Domain Adaptation for Spatio-Temporal Action Localization

This work proposes an end-to-end unsupervised domain adaptation algorithm that extends the state-of-the-art object detection framework to localize and classify actions and shows that significant performance gain can be achieved when spatial and temporal features are adapted separately, or jointly for the most effective results.

DACS: Domain Adaptation via Cross-domain Mixed Sampling

DACS: Domain Adaptation via Cross-domain mixed Sampling, which mixes images from the two domains along with the corresponding labels and pseudo-labels, and achieves state-of-the-art results for GTA5 to Cityscapes, a common synthetic-to-real semantic segmentation benchmark for UDA.

Context-Aware Mixup for Domain Adaptive Semantic Segmentation

A novel Context-Aware Mixup (CAMix) framework for domain adaptive semantic segmentation, which exploits this important clue of context-dependency as explicit prior knowledge in a fully end-to-end trainable manner for enhancing the adapt- ability toward the target domain.

SlowFast Networks for Video Recognition

This work presents SlowFast networks for video recognition, which achieves strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by the SlowFast concept.

TubeR: Tubelet Transformer for Video Action Detection

TubeR is a simple solution to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation and outperforms the previous state-of-the-art on commonly used action detection datasets AVA, UCF101-24 and JHMDB51-21.

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

The AVA dataset densely annotates 80 atomic visual actions in 437 15-minute video clips, where actions are localized in space and time, resulting in 1.59M action labels with multiple labels per person occurring frequently.

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks, but it becomes unwieldy when learning large datasets, so Mean Teacher, a method that averages model weights instead of label predictions, is proposed.

Unsupervised Domain Adaptation by Backpropagation

The method performs very well in a series of image classification experiments, achieving adaptation effect in the presence of big domain shifts and outperforming previous state-of-the-art on Office datasets.

Interact before Align: Leveraging Cross-Modal Knowledge for Domain Adaptive Action Recognition

A novel model is presented that jointly considers cross-modal knowledge interaction and cross- modality complementarity for domain adaptive action recognition and can significantly outperform state-of-the-art methods on multiple benchmark datasets, including the complex fine-grained dataset EPIC-Kitchens-100.

End-to-End Semi-Supervised Learning for Video Action Detection

  • Akash KumarY. Rawat
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
This work proposes a simple end-to-end consistency based approach which effectively utilizes the unlabeled data for video action detection and proposes two novel regularization constraints for spatio-temporal consistency; temporal coherency, and gradient smoothness.