Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation

  title={Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation},
  author={Gedas Bertasius and Lorenzo Torresani},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Gedas Bertasius, L. Torresani
  • Published 10 December 2019
  • Computer Science
  • 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
We introduce a method for simultaneously classifying, segmenting and tracking object instances in a video sequence. Our method, named MaskProp, adapts the popular Mask R-CNN to video by adding a mask propagation branch that propagates frame-level object instance masks from each video frame to all the other frames in a video clip. This allows our system to predict clip-level instance tracks with respect to the object instances segmented in the middle frame of the clip. Clip-level instance tracks… 

Figures and Tables from this paper

Video Instance Segmentation with a Propose-Reduce Paradigm
This work proposes a new paradigm – Propose-Reduce, to generate complete sequences for input videos by a single step, and builds a sequence propagation head on the existing image-level instance segmentation network for long-term propagation.
End-to-End Video Instance Segmentation with Transformers
A new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem, and achieves the highest speed among all existing VIS models and the best result among methods using single model on the YouTube-VIS dataset.
MSN: Efficient Online Mask Selection Network for Video Instance Segmentation
This work presents a novel solution for Video Instance Segmentation (VIS), that is automatically generating instance level segmentation masks along with object class and tracking them in a video using the Mask Selection Network (MSN).
Occluded Video Instance Segmentation
A large scale dataset called OVIS for occluded video instance segmentation is collected, and to complement missing object cues caused by occlusion, a plugand-play module called temporal feature calibration is proposed, built upon MaskTrack R-CNN and SipMask.
Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency
This work introduces a new MaskConsist module, which addresses the problem of missing object instances by transferring stable predictions between neighboring frames during training, and proposes two ways to leverage inter-pixel relation network (IRN) to effectively incorporate motion information during training.
Spatial Feature Calibration and Temporal Fusion for Effective One-stage Video Instance Segmentation
This work proposes a simple yet effective one-stage video instance segmentation framework by spatial calibration and temporal fusion, namely STMask, which helps the frame-work to handle challenging videos such as motion blur, partial occlusion and unusual object-to-camera poses.
SG-Net: Spatial Granularity Network for One-Stage Video Instance Segmentation
This work proposes a one-stage spatial granularity network (SG-Net) and presents state-of-the-art comparisons on the YouTube-VIS dataset, hoping it could serve as a strong and flexible base-line for the VIS task.
VideoClick: Video Object Segmentation with a Single Click
This paper proposes a bottom up approach where given a single click for each object in a video, the segmentation masks of these objects in the full video are obtained and this approach outperforms all the baselines in this challenging setting.
Target-Aware Object Discovery and Association for Unsupervised Video Multi-Object Segmentation
This paper proposes to combine foreground region estimation and instance grouping together in one network, and additionally introduces temporal guidance for segmenting each frame, enabling more accurate object discovery and outperforms state-of theart methods both in segmentation accuracy and inference speed.
1st Place Solution for YouTubeVOS Challenge 2021: Video Instance Segmentation
A unified model to mutually learn two modules, named Temporally Correlated Instance Segmentation (TCIS) and Bidirectional Tracking (BiTrack), to take the benefit of the temporal correlation between the object’s instance masks across adjacent frames to compensate for the data deficiency.


Learning Video Object Segmentation from Static Images
It is demonstrated that highly accurate object segmentation in videos can be enabled by using a convolutional neural network (convnet) trained with static images only, and a combination of offline and online learning strategies are used.
Video Instance Segmentation
The first time that the image instance segmentation problem is extended to the video domain, and a novel algorithm called MaskTrack R-CNN is proposed for this task, which is simultaneous detection, segmentation and tracking of instances in videos.
PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation
This work addresses semi-supervised video object segmentation, the task of automatically generating accurate and consistent pixel masks for objects in a video sequence, given the first-frame ground truth annotations, with the PReMVOS algorithm.
UnOVOST: Unsupervised Offline Video Object Segmentation and Tracking
A novel tracklet-based Forest Path Cutting data association algorithm which builds up a decision forest of track hypotheses before cutting this forest into paths that form long-term consistent object tracks which performs competitively with many semi-supervised video object segmentation algorithms.
Flow-Guided Feature Aggregation for Video Object Detection
This work presents flow-guided feature aggregation, an accurate and end-to-end learning framework for video object detection that improves the per-frame features by aggregation of nearby features along the motion paths, and thus improves the video recognition accuracy.
Efficient Video Object Segmentation via Network Modulation
This work proposes a novel approach that uses a single forward pass to adapt the segmentation model to the appearance of a specific object and is 70× faster than fine-tuning approaches and achieves similar accuracy.
Video Instance Segmentation 2019: A Winning Approach for Combined Detection, Segmentation, Classification and Tracking.
This work divides VIS into these four parts: detection, segmentation, tracking and classification, and develops algorithms for performing each of these four sub tasks individually, and combines these into a complete solution for VIS.
Learning Video Object Segmentation with Visual Memory
A novel two-stream neural network with an explicit memory module to achieve the task of segmenting moving objects in unconstrained videos and provides an extensive ablative analysis to investigate the influence of each component in the proposed framework.
Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning
The proposed method supports different kinds of user input such as segmentation mask in the first frame (semi-supervised scenario), or a sparse set of clicked points (interactive scenario), and reaches comparable quality to competing methods with much less interaction.
MOTS: Multi-Object Tracking and Segmentation
This paper creates dense pixel-level annotations for two existing tracking datasets using a semi-automatic annotation procedure, and proposes a new baseline method which jointly addresses detection, tracking, and segmentation with a single convolutional network.