Video Test-Time Adaptation for Action Recognition
@article{Lin2022VideoTA, title={Video Test-Time Adaptation for Action Recognition}, author={Wei Lin and Muhammad Jehanzeb Mirza and Mateusz Koziński and Horst Possegger and Hilde Kuehne and Horst Bischof}, journal={ArXiv}, year={2022}, volume={abs/2211.15393} }
Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal models that is capable of adaptation on a single video sample at a step. It consists in a feature…
Figures and Tables from this paper
References
SHOWING 1-10 OF 56 REFERENCES
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
- Computer ScienceECCV
- 2016
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.…
Self-supervised Test-time Adaptation on Video Data
- Computer Science2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
- 2022
This paper explores whether the recent progress in test-time adaptation in the image domain and self-supervised learning can be lever-aged to adapt a model to previously unseen and unlabelled videos presenting both mild (but arbitrary) and severe covariate shifts.
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.
Large-scale Robustness Analysis of Video Action Recognition Models
- Computer ScienceArXiv
- 2022
A large-scale robustness analysis of existing models for video action recognition based on convolutional neural network with some recent transformer based approaches, which reveals transformer based models are consistently more robust against most of the perturbations when compared with CNN based models.
Is Space-Time Attention All You Need for Video Understanding?
- Computer ScienceICML
- 2021
This paper presents a convolution-free approach to video classification built exclusively on self-attention over space and time, and suggests that “divided attention,” where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered.
Video Transformer Network
- Computer Science2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
- 2021
Inspired by recent developments in vision transformers, VTN is presented, a transformer-based framework for video recognition that enables whole video analysis, via a single end-to-end pass, while requiring 1.5× fewer GFLOPs.
Recurring the Transformer for Video Action Recognition
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
A novel Recurrent Vision Transformer framework based on spatial-temporal representation learning to achieve the video action recognition task, equipped with an attention gate to build interaction between current frame input and previous hidden state, thus aggregating the global level interframe features through the hidden state temporally.
Parameter-free Online Test-time Adaptation
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
This paper investigates how test-time adaptation methods fare for a number of pre-trained models on a variety of real-world scenarios, and proposes a particularly “conservative” approach, which addresses the problem with a Laplacian Adjusted Maximum-likelihood Estimation (LAME) objective.
Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment
- Computer ScienceIJCAI
- 2022
A new test-time adaptation method called class-conditional feature alignment (CFA), which minimizes both the class-Conditional distribution differences and the whole distribution differences of the hidden representation between the source and target in an online manner, is proposed.
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
This work presents a novel end-to-end Transformer-based Directed Attention (Direc-Former) framework that consistently achieves the state-of-the-art (SOTA) results compared with the recent action recognition methods.