• Corpus ID: 233296838

TransVG: End-to-End Visual Grounding with Transformers

  title={TransVG: End-to-End Visual Grounding with Transformers},
  author={Jiajun Deng and Zhengyuan Yang and Tianlang Chen and Wen-gang Zhou and Houqiang Li},
In this paper, we present a neat yet effective transformerbased framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. The state-of-the-art methods, including two-stage or one-stage ones, rely on a complex module with manually-designed mechanisms to perform the query reasoning and multi-modal fusion. However, the involvement of certain mechanisms in fusion module design, such as query decomposition and image… 

Figures and Tables from this paper

Referring Transformer: A One-step Approach to Multi-task Visual Grounding
A simple one-stage multi-task framework for visual grounding tasks using a transformer architecture, where two modalities are fused in a visual-lingual encoder that outperform state-of-the-art methods by a large margin on both REC and RES tasks.
Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling
A vision-language (VL) model that unifies text generation and bounding box prediction into a single architecture that achieves comparable performance to task-specific state of the art on 7 VL benchmarks and shows the capability of generalizing to new tasks such as ImageNet object localization.
SAT: 2D Semantics Assisted Training for 3D Visual Grounding
2D Semantics Ass Training (SAT) is proposed that utilizes 2D image semantics in the training stage to ease point-cloud-language joint representation learning and assist 3D visual grounding.
SPViT: Enabling Faster Vision Transformers via Soft Token Pruning
A dynamic attention-based multi-head token selector, which is a lightweight module for adaptive instance-wise token selection and introduces a soft pruning technique, which integrates the less informative tokens generated by the selector module into a package token that will participate in subsequent calculations rather than being completely discarded.
Transformers in Vision: A Survey
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding.


G3raphGround: Graph-Based Language Grounding
The model, which is called GraphGround, uses graphs to formulate more complex, non-sequential dependencies among proposal image regions and phrases, and captures intra-modal dependencies using a separate graph neural network for each modality (visual and lingual), and then uses conditional message-passing in anothergraph neural network to fuse their outputs and capture cross- modal relationships.
Learning to Assemble Neural Module Tree Networks for Visual Grounding
A novel modular network called Neural Module Tree network (NMTree) is developed that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed.
Real-Time Referring Expression Comprehension by Single-Stage Grounding Network
The proposed Single-Stage Grounding network is time efficient and can ground a referring expression in a 416*416 image from the RefCOCO dataset in 25ms (40 referents per second) on average with a Nvidia Tesla P40, accomplishing more than 9* speedups over the existing multi-stage models.
A Fast and Accurate One-Stage Approach to Visual Grounding
A simple, fast, and accurate one-stage approach to visual grounding that enables end-to-end joint optimization and shows great potential in terms of both accuracy and speed for both phrase localization and referring expression comprehension.
Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding
This paper introduces the diversity and discrimination simultaneously when generating proposals, and in doing so proposes Diversified and Discriminative Proposal Networks model (DDPN), a high performance baseline model for visual grounding and evaluates it on four benchmark datasets.
Object Relational Graph With Teacher-Recommended Learning for Video Captioning
This paper proposes an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation and designs a teacher-recommended learning method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
Grounding Referring Expressions in Images by Variational Context
A variational Bayesian method to solve the problem of complex context modeling in referring expression grounding by exploiting the reciprocal relation between the referent and context, and thereby the search space of context can be greatly reduced.
Learning to Compose and Reason with Language Tree Structures for Visual Grounding
A natural language grounding model that can automatically compose a binary tree structure for parsing the language and then perform visual reasoning along the tree in a bottom-up fashion and achieves the state-of-the-art performance with more explainable reasoning is proposed.
Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing
A novel cross-modal attention-guided erasing approach is designed, where the most dominant information from either textual or visual domains is discarded to generate difficult training samples online, and to drive the model to discover complementary textual-visual correspondences.