Learning Video Object Segmentation From Unlabeled Videos

@article{Lu2020LearningVO,
  title={Learning Video Object Segmentation From Unlabeled Videos},
  author={Xiankai Lu and Wenguan Wang and Jianbing Shen and Yu-Wing Tai and David J. Crandall and Steven C. H. Hoi},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2020},
  pages={8957-8967}
}
  • Xiankai Lu, Wenguan Wang, +3 authors S. Hoi
  • Published 2020
  • Computer Science
  • 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos, unlike most existing methods which rely heavily on extensive annotated data. We introduce a unified unsupervised/weakly supervised learning framework, called MuG, that comprehensively captures intrinsic properties of VOS at multiple granularities. Our approach can help advance understanding of visual patterns in VOS and significantly reduce annotation burden. With a… Expand
Weakly Supervised Video Salient Object Detection
TLDR
An “Appearance-motion fusion module” and bidirectional ConvLSTM based framework are proposed to achieve effective multi-modal learning and long-term temporal context modeling based on the first weakly supervised video salient object detection model based on relabeled fixation guided scribble annotations. Expand
Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency
TLDR
This work introduces a new MaskConsist module, which addresses the problem of missing object instances by transferring stable predictions between neighboring frames during training, and proposes two ways to leverage inter-pixel relation network (IRN) to effectively incorporate motion information during training. Expand
Reducing the Annotation Effort for Video Object Segmentation Datasets
TLDR
A deep convolutional network is used to automatically create pseudo-labels on a pixel level from much cheaper bounding box annotations and it is found that adding a manually annotated mask in only a single video frame for each object is sufficient to generate pseudo-Labels which can be used to train a VOS method to reach almost the same performance level as when training with fully segmented videos. Expand
Learning to Better Segment Objects from Unseen Classes with Unlabeled Videos
TLDR
This paper introduces a Bayesian method that is specifically designed to automatically create a high-quality training set which significantly boosts the performance of segmenting objects of unseen classes and could open the door for open-world instance segmentation using abundant Internet videos. Expand
Mask Selection and Propagation for Unsupervised Video Object Segmentation
TLDR
This work introduces a novel idea of assessing mask quality using a neural network called Selector Net, trained is such way that it is generalizes across various datasets and is able to limit the noise accumulated along the video. Expand
Self-supervised Segmentation via Background Inpainting
TLDR
This work introduces a self-supervised detection and segmentation approach that can work with single images captured by a potentially moving camera, and develops a Monte Carlo-based training strategy that allows the algorithm to explore the large space of object proposals. Expand
Self-Supervised Video Object Segmentation by Motion-Aware Mask Propagation
TLDR
Evaluation on DAVis-2017 and YouTube-VOS datasets show that MAMP achieves state-of-the-art performance with stronger generalization ability compared to existing self-supervised methods, i.e. 4.9% higher mean J&F on DAVIS- 2017 and 4.85% highermean J &F on the unseen categories of YouTube- VOS than the nearest competitor. Expand
A Survey on Deep Learning Technique for Video Segmentation
TLDR
This survey comprehensively review two basic lines of research in this area, i.e., generic object segmentation in videos and video semantic segmentation, by introducing their respective task settings, background concepts, perceived need, development history, and main challenges. Expand
Delving Deep Into Many-to-Many Attention for Few-Shot Video Object Segmentation
TLDR
A novel Domain Agent Network (DAN) is proposed, breaking down the full-rank attention into two smaller ones, allowing a linear space and time complexity as opposed to the original quadratic form with no loss of performance. Expand
Self-supervised Human Detection and Segmentation via Multi-view Consensus
TLDR
This work proposes using a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training via coarse 3D localization in a voxel grid and fine-grained offset regression and shows that this approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 84 REFERENCES
Weakly Supervised Learning of Object Segmentations from Web-Scale Video
TLDR
This work proposes to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos to automatically generate spatiotemporal masks for each object, such as "dog", without employing any pre-trained object detectors. Expand
Learning Video Object Segmentation from Static Images
TLDR
It is demonstrated that highly accurate object segmentation in videos can be enabled by using a convolutional neural network (convnet) trained with static images only, and a combination of offline and online learning strategies are used. Expand
Semantic object segmentation via detection in weakly labeled video
TLDR
In this paper, a novel video segmentation-by-detection framework is proposed, which first incorporates object and region detectors pre-trained on still images to generate a set of detection and segmentation proposals that refine the object segmentation while preserving the spatiotemporal consistency. Expand
Semantic Co-segmentation in Videos
TLDR
This paper proposes to segment objects and understand their visual semantics from a collection of videos that link to each other, which it refers to as semantic co-segmentation, and utilizes a tracking-based approach to generate multiple object-like tracklets across the video. Expand
Online Adaptation of Convolutional Neural Networks for Video Object Segmentation
TLDR
Online Adaptive Video Object Segmentation (OnAVOS) is proposed which updates the network online using training examples selected based on the confidence of the network and the spatial configuration and adds a pretraining step based on objectness, which is learned on PASCAL. Expand
Learning object class detectors from weakly annotated video
TLDR
It is shown that training from a combination of weakly annotated videos and fully annotated still images using domain adaptation improves the performance of a detector trained from still images alone. Expand
Unsupervised Deep Tracking
TLDR
The proposed unsupervised tracker achieves the baseline accuracy of fully supervised trackers, which require complete and accurate labels during training, and exhibits a potential in leveraging unlabeled or weakly labeled data to further improve the tracking accuracy. Expand
See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks
TLDR
This work introduces a novel network, called as CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view and proposes a unified and end-to-end trainable framework where different co-att attention variants can be derived for mining the rich context within videos. Expand
Bringing Background into the Foreground: Making All Classes Equal in Weakly-Supervised Video Semantic Segmentation
TLDR
This paper introduces an approach to weak supervision using only image tags by making use of classifier heatmaps, and develops a two-stream deep architecture that jointly leverages appearance and motion, and design a loss based on the authors' heatmaps to train it. Expand
Fast Video Object Segmentation by Reference-Guided Mask Propagation
TLDR
A deep Siamese encoder-decoder network is proposed that is designed to take advantage of mask propagation and object detection while avoiding the weaknesses of both approaches, and achieves accuracy competitive with state-of-the-art methods while running in a fraction of time compared to others. Expand
...
1
2
3
4
5
...