One-stage Video Instance Segmentation: From Frame-in Frame-out to Clip-in Clip-out

@article{Li2022OnestageVI,
  title={One-stage Video Instance Segmentation: From Frame-in Frame-out to Clip-in Clip-out},
  author={Minghan Li and Lei Zhang},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.06421}
}
Many video instance segmentation (VIS) methods partition a video sequence into individual frames to detect and segment objects frame by frame. However, such a frame-in frame-out (FiFo) pipeline is ineffective to exploit the temporal information. Based on the fact that adjacent frames in a short clip are highly coherent in content, we propose to extend the one-stage FiFo framework to a clip-in clip-out (CiCo) one, which performs VIS clip by clip. Specifically, we stack FPN features of all frames… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 30 REFERENCES

Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation

The method, named MaskProp, adapts the popular Mask R-CNN to video by adding a mask propagation branch that propagates frame-level object instance masks from each video frame to all the other frames in a video clip to predict clip-level instance tracks with respect to the object instances segmented in the middle frame of the clip.

End-to-End Video Instance Segmentation with Transformers

A new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem, and achieves the highest speed among all existing VIS models and the best result among methods using single model on the YouTube-VIS dataset.

STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos

A novel approach that segments and tracks instances across space and time in a single stage and is trained end-to-end to learn spatio-temporal embeddings as well as parameters required to cluster pixels belonging to a specific objectinstance over an entire video clip is proposed.

Spatial Feature Calibration and Temporal Fusion for Effective One-stage Video Instance Segmentation

This work proposes a simple yet effective one-stage video instance segmentation framework by spatial calibration and temporal fusion, namely STMask, which helps the frame-work to handle challenging videos such as motion blur, partial occlusion and unusual object-to-camera poses.

Video Instance Segmentation using Inter-Frame Communication Transformers

This work proposes Inter-frame Communication Transformers (IFC), which reduces the overhead for information-passing between frames by efficiently encoding the context within the input clip by utilizing concise memory tokens as a means of conveying information as well as summarizing each frame scene.

Video Instance Segmentation with a Propose-Reduce Paradigm

This work proposes a new paradigm – Propose-Reduce, to generate complete sequences for input videos by a single step, and builds a sequence propagation head on the existing image-level instance segmentation network for long-term propagation.

SG-Net: Spatial Granularity Network for One-Stage Video Instance Segmentation

This work proposes a one-stage spatial granularity network (SG-Net) and presents state-of-the-art comparisons on the YouTube-VIS dataset, hoping it could serve as a strong and flexible base-line for the VIS task.

Crossover Learning for Fast Online Video Instance Segmentation

A novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames to enable efficient cross-frame instance-to-pixel relation learning and brings cost-free improvement during inference.

Video Instance Segmentation

The first time that the image instance segmentation problem is extended to the video domain, and a novel algorithm called MaskTrack R-CNN is proposed for this task, which is simultaneous detection, segmentation and tracking of instances in videos.

SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation

A fast single-stage instance segmentation method that preserves instance-specific spatial information by separating mask prediction of an instance to different sub-regions of a detected bounding-box, leading to improved mask predictions and a mask alignment weighting loss and a feature alignment scheme to better correlate mask prediction with object detection.