• Corpus ID: 53633371

Action2Vec: A Crossmodal Embedding Approach to Action Learning

@article{Hahn2019Action2VecAC,
  title={Action2Vec: A Crossmodal Embedding Approach to Action Learning},
  author={Meera Hahn and Andrew Silva and James M. Rehg},
  journal={ArXiv},
  year={2019},
  volume={abs/1901.00484}
}
We describe a novel cross-modal embedding space for actions, named Action2Vec, which combines linguistic cues from class labels with spatio-temporal features derived from video clips. [] Key Method We train our embedding using a joint loss that combines classification accuracy with similarity to Word2Vec semantics. We evaluate Action2Vec by performing zero shot action recognition and obtain state of the art results on three standard datasets. In addition, we present two novel analogy tests which quantify the…

Figures and Tables from this paper

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings
TLDR
This paper proposes to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions by building a separate multi-modal embedding space for each PoS tag, which enables learning specialised embedding spaces that offer multiple views of the same embedded entities.
Cross-modal Representation Learning for Zero-shot Action Recognition
TLDR
Under a rigorous zero-shot setting of not pre-training on additional datasets, the experiment results show the model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
Reformulating Zero-shot Action Recognition for Multi-label Actions
TLDR
This work proposes a ZSAR framework which does not rely on nearest neighbor classification, but rather consists of a pairwise scoring function which allows for the prediction of several semantically distinct classes within one video input.
Developing Motion Code Embedding for Action Recognition in Videos
TLDR
A deep neural network model is developed and trained that combines visual and semantic features to identify the features found in the authors' motion taxonomy to embed or annotate videos with motion codes, which are a vectorized representation of motions based on a manipulation's salient mechanical attributes.
Unifying Few- and Zero-Shot Egocentric Action Recognition
TLDR
This work proposes a new set of splits derived from the EPIC-KITCHENS dataset that allow evaluation of open-set classification, and uses these splits to show that adding a metric-learning loss to the conventional direct-alignment baseline can improve zero-shot classification by as much as 10%, while not sacrificing few-shot performance.
DASZL: Dynamic Action Signatures for Zero-shot Learning
TLDR
This paper presents an approach to fine-grained recognition that models activities as compositions of dynamic action signatures, and shows that off-the-shelf object detectors can be used to recognize activities in completely de-novo settings with no additional training.
SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition
TLDR
This work proposes a metric learning approach to reduce the action recognition problem to a nearest neighbor search in embedding space, which generalizes well in experiments on the UTD-MHAD dataset for inertial, skeleton and fused data and the Simitate dataset for motion capturing data.
All About Knowledge Graphs for Actions
TLDR
A better understanding of knowledge graphs (KGs) that can be utilized for zero-shot and few-shot action recognition and an improved evaluation paradigm based on UCF101, HMDB51, and Charades datasets for knowledge transfer from models trained on Kinetics are proposed.
Reformulating Zero-shot Action Recognition for Multi-label Actions (Supplementary Material)
Since the AVA dataset consists of multiple actors within one video and ZSAR focuses only on the classification task, we extract clips centered on the ground-truth bounding boxes for each actor in the
TNT: Text-Conditioned Network with Transductive Inference for Few-Shot Video Classification
TLDR
This paper formulate a text-based task conditioner to adapt video features to the few-shot learning task and follows a transductive setting to improve the task-adaptation ability of the model by using the support textual descriptions and query instances to update a set of class prototypes.
...
...

References

SHOWING 1-10 OF 41 REFERENCES
Transductive Zero-Shot Action Recognition by Word-Vector Embedding
TLDR
This study constructs a mapping between visual features and a semantic descriptor of each action category, allowing new categories to be recognised in the absence of any visual training data, and achieves the state-of-the-art zero-shot action recognition performance with a simple and efficient pipeline, and without supervised annotation of attributes.
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
TLDR
This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic.
Jointly Modeling Embedding and Translation to Bridge Video and Language
TLDR
A novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual- semantic embedding and outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.
Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation
TLDR
A visual-semantic mapping with better generalisation properties and a dynamic data re-weighting method to prioritise auxiliary data that are relevant to the target classes are introduced and applied to the challenging zero-shot action recognition problem.
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
TLDR
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.
Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings
TLDR
The model combines and extends learning by association and metric learning approaches to learn implicit cross-modal connections, and produces a joint representation that captures the many-to-many relations between language and physical properties of 3D shapes such as color and shape.
Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning
TLDR
This paper proposes a new approach, namely Hierarchical Recurrent Neural Encoder (HRNE), to exploit temporal information of videos to exploit video temporal structure in a longer range by reducing the length of input information flow, and compositing multiple consecutive inputs at a higher level.
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
TLDR
This work introduces UCF101 which is currently the largest dataset of human actions and provides baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%.
Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction
TLDR
This paper contributes Word2VisualVec, a deep neural network architecture that learns to predict a deep visual encoding of textual input based on sentence vectorization and a multi-layer perceptron for image to sentence matching.
Learning Cross-Modal Embeddings for Cooking Recipes and Food Images
TLDR
This paper introduces Recipe1M, a new large-scale, structured corpus of over 1m cooking recipes and 800k food images, and demonstrates that regularization via the addition of a high-level classification objective both improves retrieval performance to rival that of humans and enables semantic vector arithmetic.
...
...