One-stage Video Instance Segmentation: From Frame-in Frame-out to Clip-in Clip-out

@article{Li2022OnestageVI,
  title={One-stage Video Instance Segmentation: From Frame-in Frame-out to Clip-in Clip-out},
  author={Minghan Li and Lei Zhang},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.06421}
}
Many video instance segmentation (VIS) methods partition a video sequence into individual frames to detect and segment objects frame by frame. However, such a frame-in frame-out (FiFo) pipeline is ineffective to exploit the temporal information. Based on the fact that adjacent frames in a short clip are highly coherent in content, we propose to extend the one-stage FiFo framework to a clip-in clip-out (CiCo) one, which performs VIS clip by clip. Specifically, we stack FPN features of all frames… 

Figures and Tables from this paper

MDQE: Mining Discriminative Query Embeddings to Segment Occluded Instances on Challenging Videos

This work proposes to mine discriminative query embeddings (MDQE) to segment occluded instances on challenging videos and proposes an inter-instance mask repulsion loss to distance each instance from its nearby non-target instances.

References

SHOWING 1-10 OF 30 REFERENCES

Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation

The method, named MaskProp, adapts the popular Mask R-CNN to video by adding a mask propagation branch that propagates frame-level object instance masks from each video frame to all the other frames in a video clip to predict clip-level instance tracks with respect to the object instances segmented in the middle frame of the clip.

STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos

A novel approach that segments and tracks instances across space and time in a single stage and is trained end-to-end to learn spatio-temporal embeddings as well as parameters required to cluster pixels belonging to a specific objectinstance over an entire video clip is proposed.

Spatial Feature Calibration and Temporal Fusion for Effective One-stage Video Instance Segmentation

This work proposes a simple yet effective one-stage video instance segmentation framework by spatial calibration and temporal fusion, namely STMask, which helps the frame-work to handle challenging videos such as motion blur, partial occlusion and unusual object-to-camera poses.

Video Instance Segmentation using Inter-Frame Communication Transformers

This work proposes Inter-frame Communication Transformers (IFC), which reduces the overhead for information-passing between frames by efficiently encoding the context within the input clip by utilizing concise memory tokens as a means of conveying information as well as summarizing each frame scene.

Video Instance Segmentation with a Propose-Reduce Paradigm

This work proposes a new paradigm – Propose-Reduce, to generate complete sequences for input videos by a single step, and builds a sequence propagation head on the existing image-level instance segmentation network for long-term propagation.

SG-Net: Spatial Granularity Network for One-Stage Video Instance Segmentation

This work proposes a one-stage spatial granularity network (SG-Net) and presents state-of-the-art comparisons on the YouTube-VIS dataset, hoping it could serve as a strong and flexible base-line for the VIS task.

CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation

This work proposes a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information to eliminate ambiguities introduced by only using single-frame features.

Crossover Learning for Fast Online Video Instance Segmentation

A novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames to enable efficient cross-frame instance-to-pixel relation learning and brings cost-free improvement during inference.

Video Instance Segmentation

The first time that the image instance segmentation problem is extended to the video domain, and a novel algorithm called MaskTrack R-CNN is proposed for this task, which is simultaneous detection, segmentation and tracking of instances in videos.

SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation

A fast single-stage instance segmentation method that preserves instance-specific spatial information by separating mask prediction of an instance to different sub-regions of a detected bounding-box, leading to improved mask predictions and a mask alignment weighting loss and a feature alignment scheme to better correlate mask prediction with object detection.