Transfer of Representations to Video Label Propagation: Implementation Factors Matter

  title={Transfer of Representations to Video Label Propagation: Implementation Factors Matter},
  author={Daniel W. McKee and Zitong Zhan and Bing Shuai and Davide Modolo and Joseph Tighe and Svetlana Lazebnik},
This work studies feature representations for dense label propagation in video, with a focus on recently proposed methods that learn video correspondence using selfsupervised signals such as colorization or temporal cycle consistency. In the literature, these methods have been evaluated with an array of inconsistent settings, making it difficult to discern trends or compare performance fairly. Starting with a unified formulation of the label propagation algorithm that encompasses most existing… 


Self-supervised Learning for Video Correspondence Flow
A simple information bottleneck is introduced that forces the model to learn robust features for correspondence matching, and prevents it from learning trivial solutions, as well as probing the upper bound by training on additional data, further demonstrating significant improvements on video segmentation.
Online Adaptation of Convolutional Neural Networks for Video Object Segmentation
Online Adaptive Video Object Segmentation (OnAVOS) is proposed which updates the network online using training examples selected based on the confidence of the network and the spatial configuration and adds a pretraining step based on objectness, which is learned on PASCAL.
Dense Contrastive Learning for Self-Supervised Visual Pre-Training
DenseCL is presented, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images and outperforms the state-of-the-art methods by a large margin.
Rethinking Self-supervised Correspondence Learning: A Video Frame-level Similarity Perspective
The hypothesis is that if the representation is good for recognition, it requires the convolutional features to find correspondence between similar objects or parts, and VFS surpasses state-of-the-art self-supervised approaches for both OTB visual object tracking and DAVIS video object segmentation.
A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation
This work presents a new benchmark dataset and evaluation methodology for the area of video object segmentation, named DAVIS (Densely Annotated VIdeo Segmentation), and provides a comprehensive analysis of several state-of-the-art segmentation approaches using three complementary metrics.
Revisiting Self-Supervised Visual Representation Learning
This study revisits numerous previously proposed self-supervised models, conducts a thorough large scale study and uncovers multiple crucial insights about standard recipes for CNN design that do not always translate to self- supervised representation learning.
Joint-task Self-supervised Learning for Temporal Correspondence
This method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking.
Unsupervised Representation Learning by Predicting Image Rotations
This work proposes to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input, and demonstrates both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning.
PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation
This work addresses semi-supervised video object segmentation, the task of automatically generating accurate and consistent pixel masks for objects in a video sequence, given the first-frame ground truth annotations, with the PReMVOS algorithm.
Associating Objects with Transformers for Video Object Segmentation
This paper investigates how to realize better and more efficient embedding learning to tackle the semi-supervised video object segmentation under challenging multi-object scenarios and proposes an Associating Objects with Transformers (AOT) approach to match and decode multiple objects uniformly.