Compact CNN for indexing egocentric videos

  title={Compact CNN for indexing egocentric videos},
  author={Yair Poleg and Ariel Ephrat and Shmuel Peleg and Chetan Arora},
  journal={2016 IEEE Winter Conference on Applications of Computer Vision (WACV)},
  • Y. PolegA. Ephrat Chetan Arora
  • Published 28 April 2015
  • Computer Science
  • 2016 IEEE Winter Conference on Applications of Computer Vision (WACV)
While egocentric video is becoming increasingly popular, browsing it is very difficult. In this paper we present a compact 3D Convolutional Neural Network (CNN) architecture for long-term activity recognition in egocentric videos. Recognizing long-term activities enables us to temporally segment (index) long and unstructured egocentric videos. Existing methods for this task are based on hand tuned features derived from visible objects, location of hands, as well as optical flow. Given a sparse… 

Figures and Tables from this paper

Going Deeper into First-Person Activity Recognition

By learning to recognize objects, actions and activities jointly, the performance of individual recognition tasks also increase by 30% (actions) and 14% ( objects) and the results of extensive ablative analysis are included to highlight the importance of network design decisions.

Egocentric video description based on temporally-linked sequences

Estimating Head Motion from Egocentric Vision

This paper estimates head (camera) motion from egocentric video, which can be further used to infer non-verbal behaviors such as head turns and nodding in multimodal interactions and suggests that CNNs do not directly learn useful visual features with end-to-end training from raw images alone.

Egocentric Video Summarization Based on People Interaction Using Deep Learning

An egocentric video summarization framework based on detecting important people in the video and using AlexNet convolutional neural network to filter the key-frames (frames where camera wearer interacts closely with the people).

Unsupervised Learning of Deep Feature Representation for Clustering Egocentric Actions

This work proposes a robust and generic unsupervised approach for first person action clustering that surpasses the supervised state of the art accuracies without using the action labels and demonstrates that clustering of features leads to the discovery of semantically meaningful actions present in the video.

For Your Eyes Only: Learning to Summarize First-Person Videos

A unique network architecture for transferring spatiotemporal information across video domains is proposed, which jointly solves metric-learning based feature embedding and keyframe selection via Bidirectional Long Short-Term Memory (BiLSTM).

Generating 1 Minute Summaries of Day Long Egocentric Videos

This paper presents a novel unsupervised reinforcement learning technique to generate video summaries from day long egocentric videos and shows that the approach generates summaries focusing on social interactions, similar to the current state-of-the-art (SOTA).

Temporal Residual Networks for Dynamic Scene Recognition

A novel ConvNet architecture based on temporal residual units that is fully convolutional in spacetime that boosts recognition performance and establishes a new state-of-the-art on dynamic scene recognition, as well as on the complementary task of action recognition.



Pooled motion features for first-person videos

A representation framework based on time series pooling, which is designed to keep track of how descriptor values are changing over time and summarize them to represent motion in the activity video.

C3D: Generic Features for Video Analysis

Convolution 3D feature is proposed, a generic spatio-temporal feature obtained by training a deep 3-dimensional convolutional network on a large annotated video dataset comprising objects, scenes, actions, and other frequently occurring concepts that encapsulate appearance and motion cues and perform well on different video classification tasks.

Temporal Segmentation of Egocentric Videos

This paper proposes a robust temporal segmentation of egocentric videos into a hierarchy of motion classes using a new Cumulative Displacement Curves and demonstrates the effectiveness of the approach using publicly available videos as well as choreographed videos.

Large-Scale Video Classification with Convolutional Neural Networks

This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.

Figure-ground segmentation improves handled object recognition in egocentric video

  • Xiaofeng RenChunhui Gu
  • Computer Science
    2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
  • 2010
This work develops a bottom-up motion-based approach to robustly segment out foreground objects in egocentric video and shows that it greatly improves object recognition accuracy.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

Egocentric recognition of handled objects: Benchmark and analysis

The quantitative evaluations show that the egocentric recognition of handled objects is a challenging but feasible problem with many unique characteristics and many opportunities for future research.

Fast unsupervised ego-action learning for first-person sports videos

This work addresses the novel task of discovering first-person action categories (which it is called ego-actions) which can be useful for such tasks as video indexing and retrieval and investigates the use of motion-based histograms and unsupervised learning algorithms to quickly cluster video content.

Learning Spatiotemporal Features with 3D Convolutional Networks

The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.

EgoSampling: Fast-forward and stereo for egocentric videos

This work proposes EgoSampling, an adaptive frame sampling that gives more stable fast forwarded videos, and turns this drawback into a feature: Stereo video can be created by sampling the frames from the left most and right most head positions of each step, forming approximate stereo-pairs.