Crossmodal Representation Learning for Zero-shot Action Recognition

  title={Crossmodal Representation Learning for Zero-shot Action Recognition},
  author={Chung-Ching Lin and Kevin Lin and Linjie Li and Lijuan Wang and Zicheng Liu},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
We present a cross-modal Transformer-based frame-work, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner. The model design provides a natural mechanism for visual and semantic representations to be learned in a shared knowledge space, whereby it encourages the learned visual embedding to be… 

Temporal and cross-modal attention for audio-visual zero-shot learning

This work proposes a multi-modal and Temporal Cross-attention Framework (TCaF) for audio-visual generalised zero-shot learning and shows that the proposed framework that ingests temporal features yields state-of-the-art performance on the UCF-GZSL cls, VGGSound-G zSL, and ActivityNet-GzSL benchmarks for (generalised) zero- shot learning.

REST: REtrieve & Self-Train for generative action recognition

REST, a training framework consisting of an unsupervised method for adapting the generative model to action/video by means of pseudo-caption generation and Self-training, is introduced, where it is shown that both components are necessary to obtain high accuracy.

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

In MOV, the vision encoder from pre-trained VLMs with minimal modifications is directly used to encode video, optical flow and audio spectrogram, and a cross-modal fusion mechanism to aggregate complimentary multimodal information is designed.

Human Action Recognition from Various Data Modalities: A Review

This paper presents a comprehensive survey of recent progress in deep learning methods for HAR based on the type of input data modality, including the fusion-based and the co-learning-based frameworks.

Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization

Open-VCLIP is introduced, a simple yet effective approach that trans-forms CLIP into strong zero-shot video classers that can recognize unseen actions and events at test time, and Interpolated Weight Optimization is proposed, which utilizes the benefit of weight interpolation in both training and test time.

Vision Transformers for Action Recognition: A Survey

This survey provides the first comprehensive survey of vision transformer techniques for action recognition and investigates different network learning strategies, such as self-supervised and zero-shot learning, along with their associated losses for transformer-based action recognition.



Semantic embedding space for zero-shot action recognition

This paper addresses zero-shot recognition in contemporary video action recognition tasks, using semantic word vector space as the common space to embed videos and category labels, and demonstrates that a simple self-training and data augmentation strategy can significantly improve the efficacy of this mapping.

Zero-Shot Visual Recognition via Bidirectional Latent Embedding

The experimental results under comparative studies demonstrate that the proposed stagewise bidirectional latent embedding framework of two subsequent learning stages for zero-shot visual recognition yields the state-of-the-art performance under inductive and transductive settings.

Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation

A visual-semantic mapping with better generalisation properties and a dynamic data re-weighting method to prioritise auxiliary data that are relevant to the target classes are introduced and applied to the challenging zero-shot action recognition problem.

Learning a Deep Embedding Model for Zero-Shot Learning

  • Li ZhangT. XiangS. Gong
  • Computer Science
    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2017
This paper proposes to use the visual space as the embedding space instead of embedding into a semantic space or an intermediate space, and argues that in this space, the subsequent nearest neighbour search would suffer much less from the hubness problem and thus become more effective.

Transductive Zero-Shot Action Recognition by Word-Vector Embedding

This study constructs a mapping between visual features and a semantic descriptor of each action category, allowing new categories to be recognised in the absence of any visual training data, and achieves the state-of-the-art zero-shot action recognition performance with a simple and efficient pipeline, and without supervised annotation of attributes.

Semantic Manifold Alignment in Visual Feature Space for Zero-Shot Learning

A novel strategy based on Aligning Semantic Manifolds in Feature Space (ASMFS) to boost the performance of ZSL by adjusting the predicted unseen semantic representations by the average of their K nearest neighbors (K-NN).

Visual Data Synthesis via GAN for Zero-Shot Video Classification

A visual data synthesis framework via GAN is proposed to address information degradation issue, which captures seen-to-unseen correlation in matched and mismatched visual-semantic pairs by mutual information, providing the zero-shot synthesis procedure with robust guidance signals.

Alternative Semantic Representations for Zero-Shot Human Action Recognition

This paper explores two alternative semantic representations especially for zero-shot human action recognition: textual descriptions of human actions and deep features extracted from still images relevant to human actions.

Towards a Fair Evaluation of Zero-Shot Action Recognition Using External Data

This work empirically show that external sources tend to have actions excessively similar to the target classes, strongly influencing the performance and violating the zero-shot premise, and proposes a corrective method to automatically filter out too similar categories by exploiting the pairwise intra-dataset similarity of the labels.

Elaborative Rehearsal for Zero-shot Action Recognition

  • Shizhe ChenDong Huang
  • Computer Science
    2021 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2021
This work proposes an ER-enhanced ZSAR model inspired by an effective human memory technique Elaborative Rehearsal (ER), which involves elaborating a new concept and relating it to known concepts and achieves state-of-the-art results on three existing benchmarks.