• Corpus ID: 244714912

End-to-End Referring Video Object Segmentation with Multimodal Transformers

@inproceedings{Botach2021EndtoEndRV,
  title={End-to-End Referring Video Object Segmentation with Multimodal Transformers},
  author={Adam Botach and Evgenii Zheltonozhskii and Chaim Baskin},
  year={2021}
}
The referring video object segmentation task (RVOS) involves segmentation of a text-referred object instance in the frames of a given video. Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it. In this paper, we propose a simple Transformer-based approach to RVOS. Our framework, termed Multimodal Tracking Transformer (MTTR… 

Figures and Tables from this paper

Language as Queries for Referring Video Object Segmentation
TLDR
This work proposes a simple and unified framework built upon Transformer, termed ReferFormer, which significantly outperforms the previous methods by a large margin and greatly simplifies the pipeline and the endto-end framework.

References

SHOWING 1-10 OF 62 REFERENCES
Cross-Modal Progressive Comprehension for Referring Segmentation
TLDR
A novel and effective Cross-Modal Progressive Comprehension (CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I (Image) module and aCMPC-V (Video) module to improve referring image and video segmentation models.
RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation
TLDR
This work argues that existing benchmarks used for the task of video object segmentation with referring expressions are mainly composed of trivial cases, in which referents can be identified with simple phrases, and relies on a new categorization of the phrases in the DAVIS-2017 and Actor-Action datasets into trivial and non-trivial REs.
Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation
TLDR
A Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently and ranks 1 place on CVPR2021 Referring Youtube-VOS challenge.
MDETR - Modulated Detection for End-to-End Multi-Modal Understanding
TLDR
This paper proposes MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question, and shows that the pre-training approach provides a way to handle the long tail of object categories which have very few labelled instances.
Visual-Textual Capsule Routing for Text-Based Video Segmentation
TLDR
This work proposes a capsule-based approach which performs pixel-level localization based on a natural language query describing the actor of interest that improves upon the performance of the existing state-of-the art works on single frame-based localization.
Polar Relative Positional Encoding for Video-Language Segmentation
TLDR
A novel Polar Relative Positional Encoding mechanism that represents spatial relations in a “linguistic” way, i.e., in terms of direction and range, is proposed and designed as the basic module for vision-language fusion.
Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries
TLDR
A context modulated dynamic convolutional operation is proposed in the proposed framework and notably outperforms state-of-the-art methods in actor and action video segmentation with language queries.
Collaborative Spatial-Temporal Modeling for Language-Queried Video Actor Segmentation
TLDR
This work proposes a collaborative spatial-temporal encoder-decoder framework which contains a 3D temporal encoder over the video clip to recognize the queried actions, and a 2D spatial encoding over the target frame to accurately segment the querying actors.
Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network
TLDR
A cross-modal self-attention (CMSA) module to utilize fine details of individual words and the input image or video, which effectively captures the long-range dependencies between linguistic and visual features.
URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark
TLDR
The proposed URVOS addresses the challenging problem by performing language-based object segmentation and mask propagation jointly using a single deep neural network with a proper combination of two attention models.
...
1
2
3
4
5
...