Crossmodal Representation Learning for Zero-shot Action Recognition
@article{Lin2022CrossmodalRL, title={Crossmodal Representation Learning for Zero-shot Action Recognition}, author={Chung-Ching Lin and Kevin Lin and Linjie Li and Lijuan Wang and Zicheng Liu}, journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2022}, pages={19946-19956} }
We present a cross-modal Transformer-based frame-work, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner. The model design provides a natural mechanism for visual and semantic representations to be learned in a shared knowledge space, whereby it encourages the learned visual embedding to be…
Figures and Tables from this paper
6 Citations
Temporal and cross-modal attention for audio-visual zero-shot learning
- Computer ScienceECCV
- 2022
This work proposes a multi-modal and Temporal Cross-attention Framework (TCaF) for audio-visual generalised zero-shot learning and shows that the proposed framework that ingests temporal features yields state-of-the-art performance on the UCF-GZSL cls, VGGSound-G zSL, and ActivityNet-GzSL benchmarks for (generalised) zero- shot learning.
REST: REtrieve & Self-Train for generative action recognition
- Computer ScienceArXiv
- 2022
REST, a training framework consisting of an unsupervised method for adapting the generative model to action/video by means of pseudo-caption generation and Self-training, is introduced, where it is shown that both components are necessary to obtain high accuracy.
Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models
- Computer ScienceArXiv
- 2022
In MOV, the vision encoder from pre-trained VLMs with minimal modifications is directly used to encode video, optical flow and audio spectrogram, and a cross-modal fusion mechanism to aggregate complimentary multimodal information is designed.
Human Action Recognition from Various Data Modalities: A Review
- Computer ScienceIEEE transactions on pattern analysis and machine intelligence
- 2022
This paper presents a comprehensive survey of recent progress in deep learning methods for HAR based on the type of input data modality, including the fusion-based and the co-learning-based frameworks.
Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization
- Computer Science
- 2023
Open-VCLIP is introduced, a simple yet effective approach that trans-forms CLIP into strong zero-shot video classers that can recognize unseen actions and events at test time, and Interpolated Weight Optimization is proposed, which utilizes the benefit of weight interpolation in both training and test time.
Vision Transformers for Action Recognition: A Survey
- Computer ScienceArXiv
- 2022
This survey provides the first comprehensive survey of vision transformer techniques for action recognition and investigates different network learning strategies, such as self-supervised and zero-shot learning, along with their associated losses for transformer-based action recognition.
References
SHOWING 1-10 OF 84 REFERENCES
Semantic embedding space for zero-shot action recognition
- Computer Science2015 IEEE International Conference on Image Processing (ICIP)
- 2015
This paper addresses zero-shot recognition in contemporary video action recognition tasks, using semantic word vector space as the common space to embed videos and category labels, and demonstrates that a simple self-training and data augmentation strategy can significantly improve the efficacy of this mapping.
Zero-Shot Visual Recognition via Bidirectional Latent Embedding
- Computer ScienceInternational Journal of Computer Vision
- 2017
The experimental results under comparative studies demonstrate that the proposed stagewise bidirectional latent embedding framework of two subsequent learning stages for zero-shot visual recognition yields the state-of-the-art performance under inductive and transductive settings.
Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation
- Computer ScienceECCV
- 2016
A visual-semantic mapping with better generalisation properties and a dynamic data re-weighting method to prioritise auxiliary data that are relevant to the target classes are introduced and applied to the challenging zero-shot action recognition problem.
Learning a Deep Embedding Model for Zero-Shot Learning
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
This paper proposes to use the visual space as the embedding space instead of embedding into a semantic space or an intermediate space, and argues that in this space, the subsequent nearest neighbour search would suffer much less from the hubness problem and thus become more effective.
Transductive Zero-Shot Action Recognition by Word-Vector Embedding
- Computer ScienceInternational Journal of Computer Vision
- 2016
This study constructs a mapping between visual features and a semantic descriptor of each action category, allowing new categories to be recognised in the absence of any visual training data, and achieves the state-of-the-art zero-shot action recognition performance with a simple and efficient pipeline, and without supervised annotation of attributes.
Semantic Manifold Alignment in Visual Feature Space for Zero-Shot Learning
- Computer Science2018 IEEE International Conference on Multimedia and Expo (ICME)
- 2018
A novel strategy based on Aligning Semantic Manifolds in Feature Space (ASMFS) to boost the performance of ZSL by adjusting the predicted unseen semantic representations by the average of their K nearest neighbors (K-NN).
Visual Data Synthesis via GAN for Zero-Shot Video Classification
- Computer ScienceIJCAI
- 2018
A visual data synthesis framework via GAN is proposed to address information degradation issue, which captures seen-to-unseen correlation in matched and mismatched visual-semantic pairs by mutual information, providing the zero-shot synthesis procedure with robust guidance signals.
Alternative Semantic Representations for Zero-Shot Human Action Recognition
- Computer ScienceECML/PKDD
- 2017
This paper explores two alternative semantic representations especially for zero-shot human action recognition: textual descriptions of human actions and deep features extracted from still images relevant to human actions.
Towards a Fair Evaluation of Zero-Shot Action Recognition Using External Data
- Computer ScienceECCV Workshops
- 2018
This work empirically show that external sources tend to have actions excessively similar to the target classes, strongly influencing the performance and violating the zero-shot premise, and proposes a corrective method to automatically filter out too similar categories by exploiting the pairwise intra-dataset similarity of the labels.
Elaborative Rehearsal for Zero-shot Action Recognition
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
This work proposes an ER-enhanced ZSAR model inspired by an effective human memory technique Elaborative Rehearsal (ER), which involves elaborating a new concept and relating it to known concepts and achieves state-of-the-art results on three existing benchmarks.