Corpus ID: 53633371

Action2Vec: A Crossmodal Embedding Approach to Action Learning

@article{Hahn2019Action2VecAC,
  title={Action2Vec: A Crossmodal Embedding Approach to Action Learning},
  author={Meera Hahn and Andrew Silva and James M. Rehg},
  journal={ArXiv},
  year={2019},
  volume={abs/1901.00484}
}
We describe a novel cross-modal embedding space for actions, named Action2Vec, which combines linguistic cues from class labels with spatio-temporal features derived from video clips. [...] Key Method We train our embedding using a joint loss that combines classification accuracy with similarity to Word2Vec semantics. We evaluate Action2Vec by performing zero shot action recognition and obtain state of the art results on three standard datasets. In addition, we present two novel analogy tests which quantify the…Expand
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings
TLDR
This paper proposes to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions by building a separate multi-modal embedding space for each PoS tag, which enables learning specialised embedding spaces that offer multiple views of the same embedded entities. Expand
Developing Motion Code Embedding for Action Recognition in Videos
TLDR
A deep neural network model is developed and trained that combines visual and semantic features to identify the features found in the authors' motion taxonomy to embed or annotate videos with motion codes, which are a vectorized representation of motions based on a manipulation's salient mechanical attributes. Expand
Unifying Few- and Zero-Shot Egocentric Action Recognition
TLDR
This work proposes a new set of splits derived from the EPIC-KITCHENS dataset that allow evaluation of open-set classification, and uses these splits to show that adding a metric-learning loss to the conventional direct-alignment baseline can improve zero-shot classification by as much as 10%, while not sacrificing few-shot performance. Expand
All About Knowledge Graphs for Actions
TLDR
A better understanding of knowledge graphs (KGs) that can be utilized for zero-shot and few-shot action recognition and an improved evaluation paradigm based on UCF101, HMDB51, and Charades datasets for knowledge transfer from models trained on Kinetics are proposed. Expand
Zero-shot Recognition of Complex Action Sequences
TLDR
This work presents a framework for straightforward modeling of activities as a state machine of dynamic attributes and shows that encoding the temporal structure of attributes greatly increases the modeling power, allowing us to capture action direction, for example. Expand
Learning Visual Actions Using Multiple Verb-Only Labels
TLDR
It is demonstrated that multi-label verb-only representations outperform conventional single verb labels, and other benefits of a multi-verb representation including cross-dataset retrieval and verb type manner and result verb types retrieval. Expand
Few-Shot Action Localization without Knowing Boundaries
TLDR
This paper proposes a network that learns to estimate Temporal Similarity Matrices that model a fine-grained similarity pattern between pairs of videos (trimmed or untrimmed), and uses them to generate Temporal Class Activation Maps (TCAMs) for seen or unseen classes, and achieves performance comparable or better to state-of-the-art fully-supervised, few-shot learning methods. Expand
Learning Video Models from Text: Zero-Shot Anticipation for Procedural Actions
TLDR
A hierarchical model that generalizes instructional knowledge from large-scale text-corpora and transfers the knowledge to video recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language. Expand
Action Type induction from multilingual lexical features
This paper presents a vector representation and a clustering of action concepts based on lexical features extracted from IMAGACT, a multilingual and multimodal ontology of actions in which conceptsExpand
Multi-Modal Zero-Shot Sign Language Recognition
TLDR
This work proposes a multi-modal Zero-Shot Sign Language Recognition (ZS-SLR) model harnessing from the complementary capabilities of deep features fused with the skeleton-based ones, and uses an Auto-Encoder on top of the Long Short Term Memory (LSTM) network. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 44 REFERENCES
Semantic embedding space for zero-shot action recognition
TLDR
This paper addresses zero-shot recognition in contemporary video action recognition tasks, using semantic word vector space as the common space to embed videos and category labels, and demonstrates that a simple self-training and data augmentation strategy can significantly improve the efficacy of this mapping. Expand
Transductive Zero-Shot Action Recognition by Word-Vector Embedding
TLDR
This study constructs a mapping between visual features and a semantic descriptor of each action category, allowing new categories to be recognised in the absence of any visual training data, and achieves the state-of-the-art zero-shot action recognition performance with a simple and efficient pipeline, and without supervised annotation of attributes. Expand
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
TLDR
This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic. Expand
Jointly Modeling Embedding and Translation to Bridge Video and Language
TLDR
A novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual- semantic embedding and outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets. Expand
Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation
TLDR
A visual-semantic mapping with better generalisation properties and a dynamic data re-weighting method to prioritise auxiliary data that are relevant to the target classes are introduced and applied to the challenging zero-shot action recognition problem. Expand
Two-Stream Convolutional Networks for Action Recognition in Videos
TLDR
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Expand
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
TLDR
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced. Expand
Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings
TLDR
The model combines and extends learning by association and metric learning approaches to learn implicit cross-modal connections, and produces a joint representation that captures the many-to-many relations between language and physical properties of 3D shapes such as color and shape. Expand
Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning
TLDR
This paper proposes a new approach, namely Hierarchical Recurrent Neural Encoder (HRNE), to exploit temporal information of videos to exploit video temporal structure in a longer range by reducing the length of input information flow, and compositing multiple consecutive inputs at a higher level. Expand
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
TLDR
This work introduces UCF101 which is currently the largest dataset of human actions and provides baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. Expand
...
1
2
3
4
5
...