• Corpus ID: 232290611

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

  title={ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation},
  author={Chen Liang and Yu Wu and Yawei Luo and Yi Yang},
Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language representation into segmentation models in a bottom-up manner, which merely conducts vision-language interaction within local receptive fields of ConvNets. We argue such interaction is not fulfilled since the model can barely construct region-level… 

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

A Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently and ranks 1 place on CVPR2021 Referring Youtube-VOS challenge.

Online Video Instance Segmentation via Robust Context Fusion

A robust context fusion network is proposed to tackle VIS in an online fashion, which predicts instance segmentation frame-by-frame with a few preceding frames, and achieves the best performance among existing online VIS methods and is even better than previously published clip-level methods on the Youtube-VIS 2019 and 2021 benchmarks.

SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding

This paper proposes a simple yet powerful SiRi mechanism, which can significantly outperform previous approaches on three popular bench-marks, and is further verified by other transformer-based visual grounding model and other vision-language tasks.

R^2VOS: Robust Referring Video Object Segmentation via Relational Multimodal Cycle Consistency

This work poses an extended task from R-VOS without the semantic consensus assumption, named Robust R- VOS ( R 2 -VOS), and embraces the observation that the embedding spaces have relational consistency through the cycle of text-video-text transformation, which connects the primary and dual problems.

Language as Queries for Referring Video Object Segmentation

This work proposes a simple and unified framework built upon Transformer, termed ReferFormer, which views the language as queries and directly attends to the most relevant regions in the video frames, and significantly outperforms the previous methods by a large margin.

MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image Segmentation

MaIL is proposed, which is a more concise encoder-decoder pipeline with a MaskImage-Language trimodal encoder that unifies uni-modal feature extractors and their fusion model into a deep modality interaction encoder, facilitating sufficient feature interaction across different modalities.



Referring Expression Object Segmentation with Caption-Aware Consistency

This work proposes an end-to-end trainable comprehension network that consists of the language and visual encoders to extract feature representations from both domains and introduces the spatial-aware dynamic filters to transfer knowledge from text to image, and effectively capture the spatial information of the specified object.

Referring Image Segmentation via Cross-Modal Progressive Comprehension

A Cross-Modal Progressive Comprehension (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task of referring image segmentation.

Dynamic Multimodal Instance Segmentation guided by natural language queries

The problem of segmenting an object given a natural language expression that describes it is addressed and a novel method that integrates linguistic and visual information in the channel dimension and the intermediate information generated when downsampling the image is proposed, so that detailed segmentations can be obtained.

Polar Relative Positional Encoding for Video-Language Segmentation

A novel Polar Relative Positional Encoding mechanism that represents spatial relations in a “linguistic” way, i.e., in terms of direction and range, is proposed and designed as the basic module for vision-language fusion.

Video Captioning With Attention-Based LSTM and Semantic Consistency

A novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences with competitive or even better results than the state-of-the-art baselines for video captioning in both BLEU and METEOR.

Linguistic Structure Guided Context Modeling for Referring Image Segmentation

A "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction and implement this scheme as a novel Linguistic Structure guided Context Modeling (LSCM) module.

Segmentation from Natural Language Expressions

An end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information is proposed that can produce quality segmentation output from the natural language expression, and outperforms baseline methods by a large margin.

Cross-Modal Self-Attention Network for Referring Image Segmentation

A cross-modal self-attention (CMSA) module that effectively captures the long-range dependencies between linguistic and visual features and a gated multi-level fusion module to selectively integrateSelf-attentive cross- modal features corresponding to different levels in the image.

Image Captioning with Semantic Attention

This paper proposes a new algorithm that combines top-down and bottom-up approaches to natural language description through a model of semantic attention, and significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.

See-Through-Text Grouping for Referring Image Segmentation

The proposed method is driven by a convolutional-recurrent neural network (ConvRNN) that iteratively carries out top-down processing of bottom-up segmentation cues and derives a See-through-Text Embedding Pixelwise (STEP) heatmap, which reveals segmentsation cues of pixel level via the learned visual-textual co-embedding.