Hybrid-S2S: Video Object Segmentation with Recurrent Networks and Correspondence Matching

@inproceedings{Azimi2021HybridS2SVO,
  title={Hybrid-S2S: Video Object Segmentation with Recurrent Networks and Correspondence Matching},
  author={Fatemeh Azimi and Stanislav Frolov and Federico Raue and J{\"o}rn Hees and Andreas R. Dengel},
  booktitle={VISIGRAPP},
  year={2021}
}
One-shot Video Object Segmentation~(VOS) is the task of pixel-wise tracking an object of interest within a video sequence, where the segmentation mask of the first frame is given at inference time. In recent years, Recurrent Neural Networks~(RNNs) have been widely used for VOS tasks, but they often suffer from limitations such as drift and error propagation. In this work, we study an RNN-based architecture and address some of these issues by proposing a hybrid sequence-to-sequence architecture… 

Figures and Tables from this paper

Self-supervised Test-time Adaptation on Video Data
TLDR
This paper explores whether the recent progress in test-time adaptation in the image domain and self-supervised learning can be lever-aged to adapt a model to previously unseen and unlabelled videos presenting both mild (but arbitrary) and severe covariate shifts.
Spatial Transformer Networks for Curriculum Learning
TLDR
This work takes inspiration from Spatial Transformer Networks (STNs) in order to form an easy-to-hard curriculum, and hypothesizes that images processed by STNs can be seen as easier tasks and utilized in the interest of curriculum learning.

References

SHOWING 1-10 OF 57 REFERENCES
Revisiting Sequence-to-Sequence Video Object Segmentation with Multi-Task Loss and Skip-Memory
TLDR
This work builds upon a sequence-to-sequence approach that employs an encoder-decoder architecture together with a memory module for exploiting the sequential data and proposes a model that manipulates multiscale spatio-temporal information using memory-equipped skip connections.
RVOS: End-To-End Recurrent Network for Video Object Segmentation
TLDR
This work proposes a Recurrent network for multiple object Video Object Segmentation (RVOS) that is fully end-to-end trainable and achieves faster inference runtimes than previous methods, reaching 44ms/frame on a P100 GPU.
Online Adaptation of Convolutional Neural Networks for Video Object Segmentation
TLDR
Online Adaptive Video Object Segmentation (OnAVOS) is proposed which updates the network online using training examples selected based on the confidence of the network and the spatial configuration and adds a pretraining step based on objectness, which is learned on PASCAL.
Learning Video Object Segmentation from Static Images
TLDR
It is demonstrated that highly accurate object segmentation in videos can be enabled by using a convolutional neural network (convnet) trained with static images only, and a combination of offline and online learning strategies are used.
Anchor Diffusion for Unsupervised Video Object Segmentation
TLDR
Inspired by the non-local operators, a technique to establish dense correspondences between pixel embeddings of a reference "anchor" frame and the current one is introduced, which allows the learning of pairwise dependencies at arbitrarily long distances without conditioning on intermediate frames.
A Transductive Approach for Video Object Segmentation
TLDR
This work proposes a simple yet strong transductive method, in which additional modules, datasets, and dedicated architectural designs are not needed, and takes a label propagation approach where pixel labels are passed forward based on feature similarity in an embedding space.
Efficient Video Object Segmentation via Network Modulation
TLDR
This work proposes a novel approach that uses a single forward pass to adapt the segmentation model to the appearance of a specific object and is 70× faster than fine-tuning approaches and achieves similar accuracy.
Video Object Segmentation Using Space-Time Memory Networks
TLDR
This work proposes a novel solution for semi-supervised video object segmentation by leveraging memory networks and learning to read relevant information from all available sources to better handle the challenges such as appearance changes and occlussions.
Video Object Segmentation without Temporal Information
TLDR
Semantic One-Shot Video Object Segmentation is presented, based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one shot).
One-Shot Video Object Segmentation
TLDR
One-Shot Video Object Segmentation (OSVOS), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one-shot).
...
...