SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric Action Recognition

  title={SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric Action Recognition},
  author={Victor Escorcia and Ricardo Guerrero and Xiatian Zhu and Brais Mart{\'i}nez},
Learning an egocentric action recognition model from video data is challenging due to distractors (e.g., irrelevant objects) in the background. Further integrating object information into an action model is hence beneficial. Existing methods often leverage a generic object detector to identify and represent the objects in the scene. However, several important issues remain. Object class annotations of good quality for the target domain (dataset) are still required for learning good object… 



Interactive Prototype Learning for Egocentric Action Recognition

An end-to-end Interactive Prototype Learning (IPL) framework to learn better active object representations by leveraging the motion cues from the actor is proposed and a set of verb prototypes to disentangle active object features from distracting object features are introduced.

Object Level Visual Reasoning in Videos

A model capable of learning to reason about semantically meaningful spatio-temporal interactions in videos is proposed that allows the model to learn detailed spatial interactions that exist at a semantic, object-interaction relevant level.

Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition

An end-to-end trainable deep neural network model for egocentric activity recognition is proposed that surpasses by up to +6% points recognition accuracy the currently best performing method that leverages hand segmentation and object location strong supervision for training.

Symbiotic Attention with Privileged Information for Egocentric Action Recognition

This paper proposes a novel Symbiotic Attention framework leveraging Privileged information (SAP) for egocentric video recognition and introduces a novel symbiotic attention (SA) to enable effective communication that achieves the state-of-the-art on two large-scale egOCentric video datasets.

Visual Compositional Learning for Human-Object Interaction Detection

A deep Visual Compositional Learning (VCL) framework is devised, which is a simple yet efficient framework to effectively address the problem of human-Object interaction detection and largely alleviates the long-tail distribution problem and benefits low-shot or zero-shot HOI detection.

Transitive Invariance for Self-Supervised Visual Representation Learning

This paper proposes to generate a graph with millions of objects mined from hundreds of thousands of videos and argues to organize and reason the data with multiple variations to exploit different self-supervised approaches to learn representations invariant to inter-instance variations.

LSTA: Long Short-Term Attention for Egocentric Action Recognition

This paper proposes LSTA as a mechanism to focus on features from spatial relevant parts while attention is being tracked smoothly across the video sequence, achieving state-of-the-art performance on four standard benchmarks.

Detecting and Recognizing Human-Object Interactions

A novel model is proposed that learns to predict an action-specific density over target object locations based on the appearance of a detected person and efficiently infers interaction triplets in a clean, jointly trained end-to-end system the authors call InteractNet.

Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases

This work demonstrates that approaches like MOCO and PIRL learn occlusion-invariant representations, but they fail to capture viewpoint and category instance invariance which are crucial components for object recognition, and proposes an approach to leverage unstructured videos to learn representations that possess higher viewpoint invariance.

First Person Action Recognition Using Deep Learned Descriptors

This work proposes convolutional neural networks (CNNs) for end to end learning and classification of wearer's actions and shows that the proposed network can generalize and give state of the art performance on various disparate egocentric action datasets.