Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation

  title={Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation},
  author={Gen Luo and Yiyi Zhou and Xiaoshuai Sun and Liujuan Cao and Chenglin Wu and Cheng Deng and Rongrong Ji},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  • Gen LuoYiyi Zhou Rongrong Ji
  • Published 19 March 2020
  • Computer Science
  • 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Referring expression comprehension (REC) and segmentation (RES) are two highly-related tasks, which both aim at identifying the referent according to a natural language expression. In this paper, we propose a novel Multi-task Collaborative Network (MCN) to achieve a joint learning of REC and RES for the first time. In MCN, RES can help REC to achieve better language-vision alignment, while REC can help RES to better locate the referent. In addition, we address a key challenge in this multi-task… 

Figures and Tables from this paper

Cascade Grouped Attention Network for Referring Expression Segmentation

A Cascade Grouped Attention Network with two innovative designs: Cascade Grouping Attention (CGA) and Instance-level Attention (ILA) loss that can achieve the high efficiency of one-stage RES while possessing a strong reasoning ability comparable to the two-stage methods.

Referring Expression Comprehension via Cross-Level Multi-Modal Fusion

A Cross-level Multi-modal Fusion (CMF) framework is designed, which gradually integrates visual and textual features of multi-layer through intra- and inter- modal.

What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study

The most encouraging finding is that with much less training overhead and parameters, SimREC can still achieve better performance than a set of large-scale pre-trained models, e.g., UNITER and VILLA, portraying the special role of REC in existing V&L research.

A Unified Mutual Supervision Framework for Referring Expression Segmentation and Generation

This paper proposes a unified mutual supervision framework that enables two tasks to improve each other and outperforms all existing methods on REG and RES tasks under the same setting.

Referring Expression Comprehension via Enhanced Cross-modal Graph Attention Networks

A novel Enhanced Cross-modal Graph Attention Networks (ECMGAN) that boosts the matching between the expression and the entity position of an image and an effective strategy named Graph Node Erase (GNE) is proposed to assist ECMGANs in eliminating the effect of irrelevant objects on the target object.

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

A simple one-stage multi-task framework for visual grounding tasks using a transformer architecture, where two modalities are fused in a visual-lingual encoder that outperform state-of-the-art methods by a large margin on both REC and RES tasks.

Correspondence Matters for Video Referring Expression Comprehension

A novel Dual Correspondence Network (dubbed as DCNet) is proposed which explicitly enhances the dense associations in both the inter-frame and cross-modal manners and predicts the patch-word correspondence through the cosine similarity.

Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension

A Self-paced Multi-grained Cross-modal Interaction Modeling framework is proposed, which improves the language-to-vision localization ability through innovations in network structure and learning mechanism and significantly outperforms state-of-the-art methods on widely used datasets.

Towards Language-guided Visual Recognition via Dynamic Convolutions

The first fully language-driven convolution network, termed as LaConvNet, is built, which can unify the visual recognition and multi-modal reasoning in one forward structure.

Bottom-Up and Bidirectional Alignment for Referring Expression Comprehension

A one-stage approach to improve referring expression comprehension (REC) which aims at grounding the referent according to a natural language expression and a progressive visual attribute decomposing approach to decompose visual proposals into several independent spaces to enhance the bottom-up alignment framework are proposed.



Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks

A graph-based, language-guided attention mechanism that represents inter-object relationships, and properties with a flexibility and power impossible with competing approaches, and enables the comprehension decision to be visualizable and explainable.

A Real-time Global Inference Network for One-stage Referring Expression Comprehension

The proposed RealGIN outperforms most existing methods and achieves very competitive performances against the most advanced one, i.e., MAttNet, under the same hardware, and can boost the processing speed by 10-20 times over the existing methods.

MAttNet: Modular Attention Network for Referring Expression Comprehension

This work proposes to decompose expressions into three modular components related to subject appearance, location, and relationship to other objects, which allows for flexibly adapt to expressions containing different types of information in an end-to-end framework.

Cross-Modal Self-Attention Network for Referring Image Segmentation

A cross-modal self-attention (CMSA) module that effectively captures the long-range dependencies between linguistic and visual features and a gated multi-level fusion module to selectively integrateSelf-attentive cross- modal features corresponding to different levels in the image.

Dynamic Multimodal Instance Segmentation guided by natural language queries

The problem of segmenting an object given a natural language expression that describes it is addressed and a novel method that integrates linguistic and visual information in the channel dimension and the intermediate information generated when downsampling the image is proposed, so that detailed segmentations can be obtained.

Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries

A unified framework, the ParalleL AttentioN (PLAN) network, to discover the object in an image that is being referred to in variable length natural expression descriptions, from short phrases query to long multi-round dialogs is proposed.

Modeling Context in Referring Expressions

This work focuses on incorporating better measures of visual context into referring expression models and finds that visual comparison to other objects within an image helps improve performance significantly.

Grounding Referring Expressions in Images by Variational Context

A variational Bayesian method to solve the problem of complex context modeling in referring expression grounding by exploiting the reciprocal relation between the referent and context, and thereby the search space of context can be greatly reduced.

Referring Expression Generation and Comprehension via Attributes

The role of attributes is explored by incorporating them into both referring expression generation and comprehension byTrain an attribute learning model from visual objects and their paired descriptions, thus expressions are generated driven by both attributes and the previous words.

Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding

This paper introduces the diversity and discrimination simultaneously when generating proposals, and in doing so proposes Diversified and Discriminative Proposal Networks model (DDPN), a high performance baseline model for visual grounding and evaluates it on four benchmark datasets.