Video Test-Time Adaptation for Action Recognition

  title={Video Test-Time Adaptation for Action Recognition},
  author={Wei Lin and Muhammad Jehanzeb Mirza and Mateusz Koziński and Horst Possegger and Hilde Kuehne and Horst Bischof},
Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal models that is capable of adaptation on a single video sample at a step. It consists in a feature… 



Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.

Self-supervised Test-time Adaptation on Video Data

This paper explores whether the recent progress in test-time adaptation in the image domain and self-supervised learning can be lever-aged to adapt a model to previously unseen and unlabelled videos presenting both mild (but arbitrary) and severe covariate shifts.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.

Large-scale Robustness Analysis of Video Action Recognition Models

A large-scale robustness analysis of existing models for video action recognition based on convolutional neural network with some recent transformer based approaches, which reveals transformer based models are consistently more robust against most of the perturbations when compared with CNN based models.

Is Space-Time Attention All You Need for Video Understanding?

This paper presents a convolution-free approach to video classification built exclusively on self-attention over space and time, and suggests that “divided attention,” where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered.

Video Transformer Network

Inspired by recent developments in vision transformers, VTN is presented, a transformer-based framework for video recognition that enables whole video analysis, via a single end-to-end pass, while requiring 1.5× fewer GFLOPs.

Recurring the Transformer for Video Action Recognition

A novel Recurrent Vision Transformer framework based on spatial-temporal representation learning to achieve the video action recognition task, equipped with an attention gate to build interaction between current frame input and previous hidden state, thus aggregating the global level interframe features through the hidden state temporally.

Parameter-free Online Test-time Adaptation

This paper investigates how test-time adaptation methods fare for a number of pre-trained models on a variety of real-world scenarios, and proposes a particularly “conservative” approach, which addresses the problem with a Laplacian Adjusted Maximum-likelihood Estimation (LAME) objective.

Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment

A new test-time adaptation method called class-conditional feature alignment (CFA), which minimizes both the class-Conditional distribution differences and the whole distribution differences of the hidden representation between the source and target in an online manner, is proposed.

DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition

This work presents a novel end-to-end Transformer-based Directed Attention (Direc-Former) framework that consistently achieves the state-of-the-art (SOTA) results compared with the recent action recognition methods.