SEMBED: Semantic Embedding of Egocentric Action Videos

  title={SEMBED: Semantic Embedding of Egocentric Action Videos},
  author={Michael Wray and Davide Moltisanti and W. Mayol-Cuevas and Dima Damen},
  booktitle={ECCV Workshops},
We present SEMBED, an approach for embedding an egocentric object interaction video in a semantic-visual graph to estimate the probability distribution over its potential semantic labels. When object interactions are annotated using unbounded choice of verbs, we embrace the wealth and ambiguity of these labels by capturing the semantic relationships as well as the visual similarities over motion and appearance features. We show how SEMBED can interpret a challenging dataset of 1225 freely… 

Towards an Unequivocal Representation of Actions

This work deviates from single-verb labels and introduces a mapping between observations and multiple verb labels - in order to create an Unequivocal Representation of Actions, which outperforms conventional single- verb labels on three egocentric datasets for both recognition and retrieval.

Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video

It is demonstrated that disagreement stems from a limited understanding of the distinct phases of an action, and proposed annotating based on the Rubicon Boundaries, inspired by a similarly named cognitive model, for consistent temporal bounds of object interactions is proposed.

Learning Visual Actions Using Multiple Verb-Only Labels

It is demonstrated that multi-label verb-only representations outperform conventional single verb labels, and other benefits of a multi-verb representation including cross-dataset retrieval and verb type manner and result verb types retrieval.

How Shall We Evaluate Egocentric Action Recognition?

This work proposes a set of measures aimed to quantitatively and qualitatively assess the performance of egocentric action recognition methods and investigates how frame-wise predictions can be turned into action-based temporal video segmentations.

An Action Is Worth Multiple Words: Handling Ambiguity in Action Recognition

This work addresses the challenge of training multi-label action recognition models from only single positive training labels by proposing two approaches that are based on generating pseudo training examples sampled from similar instances within the train set.

Multitask Learning to Improve Egocentric Action Recognition

This work considers learning the verbs and nouns from which action labels consist of and predict coordinates that capture the hand locations and the gaze-based visual saliency for all the frames of the input video segments to tackle action recognition in egocentric videos.

Personal-location-based temporal segmentation of egocentric videos for lifelogging applications

Improving Classification by Improving Labelling: Introducing Probabilistic Multi-Label Object Interaction Recognition

This work model the mapping between observations and interaction classes, as well as class overlaps, towards a probabilistic multi-label classifier that emulates human annotators, and shows that learning from annotation probabilities outperforms majority voting and enables discovery of co-occurring labels.

First-Person Action Decomposition and Zero-Shot Learning

By constructing specialized features for the decomposed concepts, this method succeeds in zero-shot learning and outperforms previous results in conventional action recognition when the performance gaps of different features on verb/noun concepts are significant.

Object Detection-Based Location and Activity Classification from Egocentric Videos: A Systematic Analysis

It is determined that the recognition of activities is related to the presence of specific objects and that the lack of explicit associations between certain activities and objects hurts classification performance for these activities.



Delving into egocentric actions

A novel set of egocentric features are presented and shown how they can be combined with motion and object features and a significant performance boost over all previous state-of-the-art methods is uncovered.

Learning to recognize objects in egocentric activities

The key to this approach is a robust, unsupervised bottom up segmentation method, which exploits the structure of the egocentric domain to partition each frame into hand, object, and background categories and uses Multiple Instance Learning to match object instances across sequences.

Discovering important people and objects for egocentric video summarization

This work introduced novel egocentric features to train a regressor that predicts important regions and produces significantly more informative summaries than traditional methods that often include irrelevant or redundant information.

Egocentric Visual Event Classification with Location-Based Priors

The method tackles the challenge of a moving camera by creating deformable graph models for classification of actions and events captured from an egocentric point of view, and presents results on a dataset collected within a cluttered environment.

Learning to Recognize Daily Actions Using Gaze

An inference method is presented that can predict the best sequence of gaze locations and the associated action label from an input sequence of images and demonstrates improvements in action recognition rates and gaze prediction accuracy relative to state-of-the-art methods.

Object-Centric Spatio-Temporal Pyramids for Egocentric Activity Recognition

A boosting approach that automatically selects a small set of useful spatio-temporal pyramid histograms among a randomized pool of candidate partitions and an “object-centric” cutting scheme that prefers sampling bin boundaries near those objects prominently involved in the egocentric activities are proposed.

Going Deeper into First-Person Activity Recognition

By learning to recognize objects, actions and activities jointly, the performance of individual recognition tasks also increase by 30% (actions) and 14% ( objects) and the results of extensive ablative analysis are included to highlight the importance of network design decisions.

YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition

This paper presents a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object, and uses a Web-scale language model to ``fill in'' novel verbs.

First Person Action Recognition Using Deep Learned Descriptors

This work proposes convolutional neural networks (CNNs) for end to end learning and classification of wearer's actions and shows that the proposed network can generalize and give state of the art performance on various disparate egocentric action datasets.

Fast unsupervised ego-action learning for first-person sports videos

This work addresses the novel task of discovering first-person action categories (which it is called ego-actions) which can be useful for such tasks as video indexing and retrieval and investigates the use of motion-based histograms and unsupervised learning algorithms to quickly cluster video content.