Learning Image Representations Tied to Egomotion from Unlabeled Video

@article{Jayaraman2017LearningIR,
  title={Learning Image Representations Tied to Egomotion from Unlabeled Video},
  author={Dinesh Jayaraman and Kristen Grauman},
  journal={International Journal of Computer Vision},
  year={2017},
  volume={125},
  pages={136-161}
}
Understanding how images of objects and scenes behave in response to specific egomotions is a crucial aspect of proper visual development, yet existing visual learning methods are conspicuously disconnected from the physical source of their images. We propose a new “embodied” visual learning paradigm, exploiting proprioceptive motor signals to train visual representations from egocentric video with no manual supervision. Specifically, we enforce that our learned features exhibit equivariance i… 
ShapeCodes: Self-supervised Feature Learning by Lifting Views to Viewgrids
TLDR
An unsupervised feature learning approach that embeds 3D shape information into a single-view image representation that successfully learns to perform “mental rotation” even for objects unseen during training, and the learned latent space is a powerful representation for object recognition, outperforming several existing unsuper supervised feature learning methods.
Learning Correspondence From the Cycle-Consistency of Time
TLDR
A self-supervised method to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch and demonstrates the generalizability of the representation -- without finetuning -- across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow.
Cross Pixel Optical Flow Similarity for Self-Supervised Learning
TLDR
This work uses motion cues in the form of optical flow, to supervise representations of static images, and achieves state-of-the-art results in self-supervision using motion cues, competitive results for self- supervision in general, and is overall state of the art inSelf-supervised pretraining for semantic image segmentation.
View Synthesis by Appearance Flow
TLDR
This work addresses the problem of novel view synthesis: given an input image, synthesizing new images of the same object or scene observed from arbitrary viewpoints and shows that for both objects and scenes, this approach is able to synthesize novel views of higher perceptual quality than previous CNN-based techniques.
Self-Supervised Representation Learning From Videos for Facial Action Unit Detection
TLDR
Experimental results demonstrate that the learned representation is discriminative for AU detection, where TCAE outperforms or is comparable with the state-of-the-art self-supervised learning methods and supervised AU detection methods.
Exploit Clues From Views: Self-Supervised and Regularized Learning for Multiview Object Recognition
TLDR
Experiments shows that the recognition and retrieval results using VISPE outperform that of other self-supervised learning methods on seen and unseen data.
Visual Learning Beyond Direct Supervision
TLDR
This thesis proposes alternative methods of supervised learning that do not require direct labels and shows that this kind of “meta-supervision” on how the output behaves, rather than what it is, turns out to be surprisingly effective in learning a variety of vision tasks.
Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning
TLDR
With VCP, a novel self-supervised method to learn rich spatial-temporal representation models (3D-CNNs) and apply such models on action recognition and video retrieval tasks, the trained models outperform the state-of-the-art self- supervised models with significant margins.
Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination
TLDR
This work forms this intuition as a non-parametric classification problem at the instance-level, and uses noise-contrastive estimation to tackle the computational challenges imposed by the large number of instance classes.
Learning View and Target Invariant Visual Servoing for Navigation
  • Yimeng Li, J. Kosecka
  • Computer Science
    2020 IEEE International Conference on Robotics and Automation (ICRA)
  • 2020
TLDR
This paper proposes a new architecture for local mobile robot navigation which overcomes the brittleness of classical visual servoing based methods and achieves significantly higher generalization capability compared to the previous learning approaches.
...
1
2
3
4
...

References

SHOWING 1-10 OF 62 REFERENCES
Learning Image Representations Tied to Ego-Motion
TLDR
This work proposes to exploit proprioceptive motor signals to provide unsupervised regularization in convolutional neural networks to learn visual representations from egocentric video to enforce that the authors' learned features exhibit equivariance, i.e, they respond predictably to transformations associated with distinct ego-motions.
Object-Centric Representation Learning from Unlabeled Videos
TLDR
This work introduces a novel object-centric approach to temporal coherence that encourages similar representations to be learned for object-like regions segmented from nearby frames in a deep convolutional neural network representation.
Deep Learning of Invariant Features via Simulated Fixations in Video
TLDR
This work applies salient feature detection and tracking in videos to simulate fixations and smooth pursuit in human vision, and achieves state-of-the-art recognition accuracy 61% on STL-10 dataset.
Learning to See by Moving
TLDR
It is found that using the same number of training images, features learnt using egomotion as supervision compare favourably to features learnt with class-label as supervision on the tasks of scene recognition, object recognition, visual odometry and keypoint matching.
Unsupervised Learning of Visual Representations Using Videos
  • X. Wang, A. Gupta
  • Computer Science
    2015 IEEE International Conference on Computer Vision (ICCV)
  • 2015
TLDR
A simple yet surprisingly powerful approach for unsupervised learning of CNN that uses hundreds of thousands of unlabeled videos from the web to learn visual representations and designs a Siamese-triplet network with a ranking loss function to train this CNN representation.
Learning to Relate Images
  • R. Memisevic
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2013
TLDR
This paper reviews the recent work on relational feature learning, and provides an analysis of the role that multiplicative interactions play in learning to encode relations, and discusses how square-pooling and complex cell models can be viewed as a way to representmultiplicative interactions and thereby as a ways to encoded relations.
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
TLDR
Applying an approach to learn action categories from static images that leverages prior observations of generic human motion to augment its training process, it enhances a state-of-the-art technique when very few labeled training examples are available.
Learning to Predict Gaze in Egocentric Video
TLDR
A model for gaze prediction in egocentric video is presented by leveraging the implicit cues that exist in camera wearer's behaviors and model the dynamic behavior of the gaze, in particular fixations, as latent variables to improve the gaze prediction.
Slow and Steady Feature Analysis: Higher Order Temporal Coherence in Video
TLDR
A convolutional neural network is trained with a regularizer on tuples of sequential frames from unlabeled video to generalize slow feature analysis to "steady" feature analysis and impose a prior that higher order derivatives in the learned feature space must be small.
Video (language) modeling: a baseline for generative models of natural videos
TLDR
For the first time, it is shown that a strong baseline model for unsupervised feature learning using video data can predict non-trivial motions over short video sequences.
...
1
2
3
4
5
...