Syntactically Guided Generative Embeddings for Zero-Shot Skeleton Action Recognition

  title={Syntactically Guided Generative Embeddings for Zero-Shot Skeleton Action Recognition},
  author={Pranay Gupta and Divyanshu Sharma and Ravi Kiran Sarvadevabhatla},
We introduce SynSE, a novel syntactically guided generative approach for Zero-Shot Learning (ZSL). Our end-to-end approach learns progressively refined generative embedding spaces constrained within and across the involved modalities (visual, language). The inter-modal constraints are defined between action sequence embedding and embeddings of Parts of Speech (PoS) tagged words in the corresponding action description. We deploy SynSE for the task of skeleton-based action sequence recognition… 

Figures and Tables from this paper

Multi-Modal Zero-Shot Sign Language Recognition
This work proposes a multi-modal Zero-Shot Sign Language Recognition (ZS-SLR) model harnessing from the complementary capabilities of deep features fused with the skeleton-based ones, and uses an Auto-Encoder on top of the Long Short Term Memory (LSTM) network.
ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos
This work formulate the problem of Zero-Shot Sign Language Recognition (ZS-SLR) and propose a twostream model from two input modalities: RGB and Depth videos, and configure a transformer encoder-decoder architecture, as a fast and accurate human detection model to overcome the challenges of the current human detection models.


Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings
This paper proposes to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions by building a separate multi-modal embedding space for each PoS tag, which enables learning specialised embedding spaces that offer multiple views of the same embedded entities.
Skeleton based Zero Shot Action Recognition in Joint Pose-Language Semantic Space
This work presents a body pose based zero shot action recognition network and demonstrates how this pose-language semantic space encodes knowledge which allows the model to correctly predict actions not seen during training.
Generalized Zero-Shot Learning via Aligned Variational Autoencoders
This work proposes a model where a shared latent space of image features and class embeddings is learned by aligned variational autoencoders, for the purpose of generating latent features to train a softmax classifier and establishes a new state of the art on generalized zero-shot learning.
Learning Robust Visual-Semantic Embeddings
An end-to-end learning framework that is able to extract more robust multi-modal representations across domains and a novel technique of unsupervised-data adaptation inference is introduced to construct more comprehensive embeddings for both labeled and unlabeled data.
Generalized Zero-Shot Learning via Synthesized Examples
This work presents a generative framework for generalized zero-shot learning where the training and test classes are not necessarily disjoint, and can generate novel exemplars from seen/unseen classes, given their respective class attributes.
Learning to Compare: Relation Network for Few-Shot Learning
A conceptually simple, flexible, and general framework for few-shot learning, where a classifier must learn to recognise new classes given only few examples from each, which is easily extended to zero- shot learning.
Feature Generating Networks for Zero-Shot Learning
A novel generative adversarial network (GAN) that synthesizes CNN features conditioned on class-level semantic information, offering a shortcut directly from a semantic descriptor of a class to a class-conditional feature distribution.
View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition
A novel view adaptation scheme, which automatically determines the virtual observation viewpoints over the course of an action in a learning based data driven manner, and a two-stream scheme (referred to as VA-fusion) that fuses the scores of the two networks to provide the final prediction, obtaining enhanced performance.
Learning Deep Representations of Fine-Grained Visual Descriptions
This model achieves strong performance on zero-shot text-based image retrieval and significantly outperforms the attribute-based state-of-the-art for zero- shot classification on the Caltech-UCSD Birds 200-2011 dataset.
Zero-Shot Learning by Convex Combination of Semantic Embeddings
A simple method for constructing an image embedding system from any existing image classifier and a semantic word embedding model, which contains the $\n$ class labels in its vocabulary is proposed, which outperforms state of the art methods on the ImageNet zero-shot learning task.