Syntactically Guided Generative Embeddings for Zero-Shot Skeleton Action Recognition

@article{Gupta2021SyntacticallyGG,
  title={Syntactically Guided Generative Embeddings for Zero-Shot Skeleton Action Recognition},
  author={Pranay Gupta and Divyanshu Sharma and Ravi Kiran Sarvadevabhatla},
  journal={ArXiv},
  year={2021},
  volume={abs/2101.11530}
}
We introduce SynSE, a novel syntactically guided generative approach for Zero-Shot Learning (ZSL). Our end-to-end approach learns progressively refined generative embedding spaces constrained within and across the involved modalities (visual, language). The inter-modal constraints are defined between action sequence embedding and embeddings of Parts of Speech (PoS) tagged words in the corresponding action description. We deploy SynSE for the task of skeleton-based action sequence recognition… Expand

Figures and Tables from this paper

Multi-Modal Zero-Shot Sign Language Recognition
TLDR
This work proposes a multi-modal Zero-Shot Sign Language Recognition (ZS-SLR) model harnessing from the complementary capabilities of deep features fused with the skeleton-based ones, and uses an Auto-Encoder on top of the Long Short Term Memory (LSTM) network. Expand
ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos
TLDR
This work formulate the problem of Zero-Shot Sign Language Recognition (ZS-SLR) and propose a twostream model from two input modalities: RGB and Depth videos, and configure a transformer encoder-decoder architecture, as a fast and accurate human detection model to overcome the challenges of the current human detection models. Expand

References

SHOWING 1-10 OF 27 REFERENCES
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings
TLDR
This paper proposes to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions by building a separate multi-modal embedding space for each PoS tag, which enables learning specialised embedding spaces that offer multiple views of the same embedded entities. Expand
Skeleton based Zero Shot Action Recognition in Joint Pose-Language Semantic Space
TLDR
This work presents a body pose based zero shot action recognition network and demonstrates how this pose-language semantic space encodes knowledge which allows the model to correctly predict actions not seen during training. Expand
Generalized Zero-Shot Learning via Aligned Variational Autoencoders
TLDR
This work proposes a model where a shared latent space of image features and class embeddings is learned by aligned variational autoencoders, for the purpose of generating latent features to train a softmax classifier and establishes a new state of the art on generalized zero-shot learning. Expand
Learning Robust Visual-Semantic Embeddings
TLDR
An end-to-end learning framework that is able to extract more robust multi-modal representations across domains and a novel technique of unsupervised-data adaptation inference is introduced to construct more comprehensive embeddings for both labeled and unlabeled data. Expand
Generalized Zero-Shot Learning via Synthesized Examples
TLDR
This work presents a generative framework for generalized zero-shot learning where the training and test classes are not necessarily disjoint, and can generate novel exemplars from seen/unseen classes, given their respective class attributes. Expand
Learning to Compare: Relation Network for Few-Shot Learning
TLDR
A conceptually simple, flexible, and general framework for few-shot learning, where a classifier must learn to recognise new classes given only few examples from each, which is easily extended to zero- shot learning. Expand
Feature Generating Networks for Zero-Shot Learning
TLDR
A novel generative adversarial network (GAN) that synthesizes CNN features conditioned on class-level semantic information, offering a shortcut directly from a semantic descriptor of a class to a class-conditional feature distribution. Expand
View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition
TLDR
A novel view adaptation scheme, which automatically determines the virtual observation viewpoints over the course of an action in a learning based data driven manner, and a two-stream scheme (referred to as VA-fusion) that fuses the scores of the two networks to provide the final prediction, obtaining enhanced performance. Expand
Learning Deep Representations of Fine-Grained Visual Descriptions
TLDR
This model achieves strong performance on zero-shot text-based image retrieval and significantly outperforms the attribute-based state-of-the-art for zero- shot classification on the Caltech-UCSD Birds 200-2011 dataset. Expand
Zero-Shot Learning by Convex Combination of Semantic Embeddings
TLDR
A simple method for constructing an image embedding system from any existing image classifier and a semantic word embedding model, which contains the $\n$ class labels in its vocabulary is proposed, which outperforms state of the art methods on the ImageNet zero-shot learning task. Expand
...
1
2
3
...