Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions

@article{Mettes2017SpatialAwareOE,
  title={Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions},
  author={Pascal Mettes and Cees G. M. Snoek},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={4453-4462}
}
  • P. Mettes, Cees G. M. Snoek
  • Published 28 July 2017
  • Computer Science
  • 2017 IEEE International Conference on Computer Vision (ICCV)
We aim for zero-shot localization and classification of human actions in video. Where traditional approaches rely on global attribute or object classification scores for their zero-shot knowledge transfer, our main contribution is a spatial-aware object embedding. To arrive at spatial awareness, we build our embedding on top of freely available actor and object detectors. Relevance of objects is determined in a word embedding space and further enforced with estimated spatial preferences… 

Figures and Tables from this paper

Object Priors for Classifying and Localizing Unseen Actions

It is found that persons and objects have preferred spatial relations that benefit unseen action localization, while using multiple languages and simple object filtering directly improves semantic matching, leading to state-of-the-art results for both unseen action classification and localization.

Guess Where? Actor-Supervision for Spatiotemporal Action Localization

Global Semantic Descriptors for Zero-Shot Action Recognition

This work introduces a new ZSAR method based on the relationships of actions-objects and actions-descriptive sentences, demonstrating that representing all object classes using descriptive sentences generates an accurate object-action affinity estimation when a paraphrase estimation method is used as an embedder.

Universal Prototype Transport for Zero-Shot Action Recognition and Localization

Empirically, it is shown that universal prototype transport diminishes the biased selection of unseen action prototypes and boosts both universal action and object models, resulting in state-of-the-art performance for zero-shot classification and spatio-temporal localization.

UvA-DARE (Digital Academic Repository) Pointly-Supervised Action Localization

It is concluded that points provide a viable alternative to boxes for action localization, which is as effective as traditional box-supervision at a fraction of the annotation cost, is robust to sparse and noisy point annotations, and outperforms recent weak-supervised alternatives.

Pointly-Supervised Action Localization

It is concluded that points provide a viable alternative to boxes for action localization, and is as effective as traditional box-supervision at a fraction of the annotation cost, is robust to sparse and noisy point annotations, benefits from pseudo-points during inference, and outperforms recent weakly-supervised alternatives.

Spatio-Temporal Instance Learning: Action Tubes from Class Supervision

This work proposes Spatio-Temporal Instance Learning, which enables action localization directly from box proposals in video frames, and outlines the assumptions of the model and proposes a max-margin objective and optimization with latent variables that enable spatio-temporal learning of actions from video labels.

Domain-Specific Priors and Meta Learning for Low-shot First-Person Action Recognition

This work develops an effective method for few-shot transfer learning for first-person action classification using independently trained local visual cues to learn representations that can be transferred from a source domain, which provides primitive action labels, to a different target domain.

Zero-Shot Action Recognition from Diverse Object-Scene Compositions

This paper proposes to construct objects and scenes as a Cartesian product of all possible compositions, and outlines how to determine the likelihood of object-scene compositions in videos, as well as a semantic matching from object- scene compositions to actions that enforces diversity among the most relevant compositions for each action.

All About Knowledge Graphs for Actions

A better understanding of knowledge graphs (KGs) that can be utilized for zero-shot and few-shot action recognition and an improved evaluation paradigm based on UCF101, HMDB51, and Charades datasets for knowledge transfer from models trained on Kinetics are proposed.
...

References

SHOWING 1-10 OF 65 REFERENCES

Objects2action: Classifying and Localizing Actions without Any Video Example

Objects2action is a semantic word embedding that is spanned by a skip-gram model of thousands of object categories that proposes a mechanism to exploit multiple-word descriptions of actions and objects and demonstrates how to extend the zero-shot approach to the spatio-temporal localization of actions in video.

UvA-DARE ( Digital Academic Repository ) Objects 2 action : Classifying and localizing actions without any video example

Objects2action is a semantic word embedding that is spanned by a skip-gram model of thousands of object categories that proposes a mechanism to exploit multiple-word descriptions of actions and objects and demonstrates how to extend the zero-shot approach to the spatio-temporal localization of actions in video.

Spot On: Action Localization from Pointly-Supervised Proposals

An overlap measure between action proposals and points is introduced and incorporated into the objective of a non-convex Multiple Instance Learning optimization and shows that the approach is competitive to the state-of-the-art.

Semantic embedding space for zero-shot action recognition

This paper addresses zero-shot recognition in contemporary video action recognition tasks, using semantic word vector space as the common space to embed videos and category labels, and demonstrates that a simple self-training and data augmentation strategy can significantly improve the efficacy of this mapping.

What do 15,000 object categories tell us about classifying and localizing actions?

It is shown that objects matter for actions, and are often semantically relevant as well, and it is revealed that object-action relations are generic, which allows to transferring these relationships from the one domain to the other.

Localizing Actions from Video Labels and Pseudo-Annotations

An intuitive and effective algorithm is proposed that localizes actions from their class label only, and pseudo-annotations can be leveraged during testing to improve weakly- and strongly-supervised localizers.

Recognizing unseen actions in a domain-adapted embedding space

This paper purpose a deep two-output model for video ZSL and action recognition tasks by computing both spatial and temporal features from video contents through distinct Convolutional Neural Networks (CNNs) and training a Multi-layer Perceptron (MLP) upon extracted features to map videos to semantic embedding word vectors.

Transductive Zero-Shot Action Recognition by Word-Vector Embedding

This study constructs a mapping between visual features and a semantic descriptor of each action category, allowing new categories to be recognised in the absence of any visual training data, and achieves the state-of-the-art zero-shot action recognition performance with a simple and efficient pipeline, and without supervised annotation of attributes.

Learning to Track for Spatio-Temporal Action Localization

The approach first detects proposals at the frame-level and scores them with a combination of static and motion CNN features, then tracks high-scoring proposals throughout the video using a tracking-by-detection approach that outperforms the state of the art with a margin of 15%, 7% and 12% respectively in mAP.

Improving bag-of-features action recognition with non-local cues

This work decompose video into region classes and augment local features with corresponding region-class labels and demonstrates how this information can be integrated with BoF representations in a kernel-combination framework.
...