TransVG: End-to-End Visual Grounding with Transformers

  title={TransVG: End-to-End Visual Grounding with Transformers},
  author={Jiajun Deng and Zhengyuan Yang and Tianlang Chen and Wen-gang Zhou and Houqiang Li},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. The state-of-the-art methods, including two-stage or one-stage ones, rely on a complex module with manually-designed mechanisms to perform the query reasoning and multi-modal fusion. However, the involvement of certain mechanisms in fusion module design, such as query decomposition and image… 

Figures and Tables from this paper

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

This work proposes TransVG, which establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates and upgrades the framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding.

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

A simple one-stage multi-task framework for visual grounding tasks using a transformer architecture, where two modalities are fused in a visual-lingual encoder that outperform state-of-the-art methods by a large margin on both REC and RES tasks.

Multi-Modal Dynamic Graph Transformer for Visual Grounding

  • Sijia ChenBaochun Li
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
Experiments show that with an average of 48 boxes as initialization, the performance of M-DGT on the Flickr30k Entities and RefCOCO datasets outperforms existing state-of-the-art methods by a substantial margin, in terms of both accuracy and Intersect over Union scores.

YORO - Lightweight End to End Visual Grounding

YORO is shown to support real-time inference and outperform all approaches in this class (single-stage methods) by large margins and achieves the best speed/accuracy trade-off in the literature.

HOIG: End-to-End Human-Object Interactions Grounding with Transformers

This paper introduces a new task of Human-Object Interactions (HOI) Grounding to localize all the referring human-object pair instances in an image with a given ⟨human, interaction, object⟩ phrase and designs an encoder-decoder architecture to model the task as a set prediction problem based on transformers.

HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding

A novel Hierarchical Attention Model (HAM), offering multi-granularity representation and efficient augmentation for both given texts and multi-modal visual inputs is introduced, which ranks first on the large-scale ScanRefer challenge, which outperforms all the existing methods by a signi ficant margin.

Vision-Language Transformer and Query Generation for Referring Segmentation

Transformer and multi-head attention are introduced and a Query Generation Module is proposed, which produces multiple sets of queries with different attention weights that represent the diversified comprehensions of the language expression from different aspects.

Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning

This paper proposes a two-step method which firstly pre-trains on large amounts of coarse-grained region-caption data and then leverages two prompt-based techniques to finetune the pre-trained model without updating its parameters, which significantly outperforms recent, strong SGG methods on the setting of Ov-SGG, as well as on the conventional closed SGG.

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

A new multimodal transformer architecture, coined as Dynamic MDETR, is presented by decoupling the whole grounding process into encoding and decoding phases by exploiting this sparsity prior to speed up the visual grounding process.

Bear the Query in Mind: Visual Grounding with Query-conditioned Convolution

This work proposes a QCM that extracts query-aware visual features by incorporating query information into the generation of convolutional kernels and achieves state-of-the-art performance on three popu-lar visual grounding datasets.



G3raphGround: Graph-Based Language Grounding

The model, which is called GraphGround, uses graphs to formulate more complex, non-sequential dependencies among proposal image regions and phrases, and captures intra-modal dependencies using a separate graph neural network for each modality (visual and lingual), and then uses conditional message-passing in anothergraph neural network to fuse their outputs and capture cross- modal relationships.

Improving One-stage Visual Grounding by Recursive Sub-query Construction

This work proposes a recursive sub-query construction framework, which reasons between image and query for multiple rounds and reduces the referring ambiguity step by step, and shows superior performances on longer and more complex queries validates the effectiveness of the query modeling.

A Fast and Accurate One-Stage Approach to Visual Grounding

A simple, fast, and accurate one-stage approach to visual grounding that enables end-to-end joint optimization and shows great potential in terms of both accuracy and speed for both phrase localization and referring expression comprehension.

Learning to Assemble Neural Module Tree Networks for Visual Grounding

A novel modular network called Neural Module Tree network (NMTree) is developed that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed.

Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding

This paper introduces the diversity and discrimination simultaneously when generating proposals, and in doing so proposes Diversified and Discriminative Proposal Networks model (DDPN), a high performance baseline model for visual grounding and evaluates it on four benchmark datasets.

Real-Time Referring Expression Comprehension by Single-Stage Grounding Network

The proposed Single-Stage Grounding network is time efficient and can ground a referring expression in a 416*416 image from the RefCOCO dataset in 25ms (40 referents per second) on average with a Nvidia Tesla P40, accomplishing more than 9* speedups over the existing multi-stage models.

Object Relational Graph With Teacher-Recommended Learning for Video Captioning

This paper proposes an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation and designs a teacher-recommended learning method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.

End-to-End Object Detection with Transformers

This work presents a new method that views object detection as a direct set prediction problem, and demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a

Grounding Referring Expressions in Images by Variational Context

A variational Bayesian method to solve the problem of complex context modeling in referring expression grounding by exploiting the reciprocal relation between the referent and context, and thereby the search space of context can be greatly reduced.