Learning Image Representations Tied to Ego-Motion

  title={Learning Image Representations Tied to Ego-Motion},
  author={Dinesh Jayaraman and Kristen Grauman},
  journal={2015 IEEE International Conference on Computer Vision (ICCV)},
Understanding how images of objects and scenes behave in response to specific ego-motions is a crucial aspect of proper visual development, yet existing visual learning methods are conspicuously disconnected from the physical source of their images. We propose to exploit proprioceptive motor signals to provide unsupervised regularization in convolutional neural networks to learn visual representations from egocentric video. Specifically, we enforce that our learned features exhibit equivariance… 

Figures and Tables from this paper

Learning Image Representations Tied to Egomotion from Unlabeled Video

This work proposes a new “embodied” visual learning paradigm, exploiting proprioceptive motor signals to train visual representations from egocentric video with no manual supervision, and shows that this unsupervised feature learning approach significantly outperforms previous approaches on visual recognition and next-best-view prediction tasks.

Learning Features by Watching Objects Move

Inspired by the human visual system, low-level motion-based grouping cues can be used to learn an effective visual representation that significantly outperforms previous unsupervised approaches across multiple settings, especially when training data for the target task is scarce.

Visuomotor Understanding for Representation Learning of Driving Scenes

This work leverages the large-scale unlabeled yet naturally paired data for visual representation learning in the driving scenario and demonstrates that the learned representation can benefit other tasks that require detailed scene understanding and outperforms competing unsupervised representations on semantic segmentation.

Unsupervised learning of image motion by recomposing sequences

It is demonstrated that a network trained using the unsupervised procedure on realworld sequences of human actions and vehicle motion can capture semantic regions corresponding to the motion in the scene, and not merely image-level differences, without requiring any motion labels.

Object-Centric Representation Learning from Unlabeled Videos

This work introduces a novel object-centric approach to temporal coherence that encourages similar representations to be learned for object-like regions segmented from nearby frames in a deep convolutional neural network representation.

Learning to Extract Motion from Videos in Convolutional Neural Networks

This paper shows how to extract dense optical flow from videos with a convolutional neural network (CNN) and outputs a distributed representation of motion that allows representing multiple, transparent motions, and dynamic textures.

Understanding image motion with group representations

This work proposes a model of motion based on elementary group properties of transformations and uses it to train a representation of image motion that captures motion in both synthetic 2D sequences and real-world sequences of vehicle motion, without requiring any labels.

The Curious Robot: Learning Visual Representations via Physical Interactions

This work builds one of the first systems on a Baxter platform that pushes, pokes, grasps and observes objects in a tabletop environment, with each datapoint providing supervision to a shared ConvNet architecture allowing us to learn visual representations.

Unsupervised Learning by Predicting Noise

This paper introduces a generic framework to train deep networks, end-to-end, with no supervision, to fix a set of target representations, called Noise As Targets (NAT), and to constrain the deep features to align to them.



Deep Learning of Invariant Features via Simulated Fixations in Video

This work applies salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision, and achieves state-of-the-art recognition accuracy 61% on STL-10 dataset.

Learning to Relate Images

  • R. Memisevic
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2013
This paper reviews the recent work on relational feature learning, and provides an analysis of the role that multiplicative interactions play in learning to encode relations, and discusses how square-pooling and complex cell models can be viewed as a way to representmultiplicative interactions and thereby as a ways to encoded relations.

Learning to See by Moving

It is found that using the same number of training images, features learnt using egomotion as supervision compare favourably to features learnt with class-label as supervision on the tasks of scene recognition, object recognition, visual odometry and keypoint matching.

Learning to Predict Gaze in Egocentric Video

A model for gaze prediction in egocentric video is presented by leveraging the implicit cues that exist in camera wearer's behaviors and model the dynamic behavior of the gaze, in particular fixations, as latent variables to improve the gaze prediction.

Moving Object Segmentation Using Motor Signals

A novel approach to detecting moving objects by clustering features into background and foreground according to their motion consistency with motor signals, which works completely in 2D image space, and does not involve any complex analysis or computation in 3D space.

Unsupervised Learning of Spatiotemporally Coherent Metrics

This work focuses on feature learning from unlabeled video data, using the assumption that adjacent video frames contain semantically similar information, and establishes a connection between slow feature learning and metric learning.

Attention Prediction in Egocentric Video Using Motion and Visual Saliency

The efficiency of the proposed framework was examined in real environments by using a head-mounted gaze tracker, and it was found that the egomotion-based attention maps contributed to accurately predicting human visual attention.

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots

Applying an approach to learn action categories from static images that leverages prior observations of generic human motion to augment its training process, it enhances a state-of-the-art technique when very few labeled training examples are available.

Video (language) modeling: a baseline for generative models of natural videos

For the first time, it is shown that a strong baseline model for unsupervised feature learning using video data can predict non-trivial motions over short video sequences.

Learning rotation-aware features: From invariant priors to equivariant descriptors

  • Uwe SchmidtS. Roth
  • Computer Science
    2012 IEEE Conference on Computer Vision and Pattern Recognition
  • 2012
This paper describes a general framework for incorporating invariance to linear image transformations into product models for feature learning and shows the advantages of this approach in learning rotation-invariant image priors and in building rotation-equivariant and invariant descriptors of learned features.