TransVG: End-to-End Visual Grounding with Transformers

@article{Deng2021TransVGEV,
  title={TransVG: End-to-End Visual Grounding with Transformers},
  author={Jiajun Deng and Zhengyuan Yang and Tianlang Chen and Wen-gang Zhou and Houqiang Li},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2021},
  pages={1749-1759}
}
In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. The state-of-the-art methods, including two-stage or one-stage ones, rely on a complex module with manually-designed mechanisms to perform the query reasoning and multi-modal fusion. However, the involvement of certain mechanisms in fusion module design, such as query decomposition and image… 

Figures and Tables from this paper

Referring Transformer: A One-step Approach to Multi-task Visual Grounding
TLDR
A simple one-stage multi-task framework for visual grounding tasks using a transformer architecture, where two modalities are fused in a visual-lingual encoder that outperform state-of-the-art methods by a large margin on both REC and RES tasks.
Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding
TLDR
A Query-modulated Refinement Network (QRNet) is proposed to address the inconsistent issue by adjusting intermediate features in the visual backbone with a novel Query-aware Dynamic Attention (QD-ATT) mechanism and query-aware multiscale fusion.
SeqTR: A Simple yet Universal Network for Visual Grounding
TLDR
Experiments on five benchmark datasets demonstrate that the proposed SeqTR outperforms (or is on par with) the existing state-of-the-arts, proving that a simple yet universal approach for visual grounding is indeed feasible.
A Survey of Visual Transformers
TLDR
This survey has reviewed over one hundred of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, and proposed the deformable attention module which combines the best of the sparse spatial sampling of deformable convo- lution, and the relation modeling capability of Transformers.
Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling
TLDR
A vision-language (VL) model that unifies text generation and bounding box prediction into a single architecture that achieves comparable performance to task-specific state of the art on 7 VL benchmarks and shows the capability of generalizing to new tasks such as ImageNet object localization.
Referring Expression Comprehension via Cross-Level Multi-Modal Fusion
TLDR
A Cross-level Multi-modal Fusion (CMF) framework is designed, which gradually integrates visual and textual features of multi-layer through intra- and inter- modal.
TubeDETR: Spatio-Temporal Video Grounding with Transformers
TLDR
TubeDETR is proposed, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection that includes an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and a space-time decoder that jointly performs spatio-temporal localization.
SAT: 2D Semantics Assisted Training for 3D Visual Grounding
TLDR
2D Semantics Ass Training (SAT) is proposed that utilizes 2D image semantics in the training stage to ease point-cloud-language joint representation learning and assist 3D visual grounding.
FindIt: Generalized Localization with Natural Language Queries
TLDR
This work proposes FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection, and discovers that a standard object detector is surprisingly ef-fective in unifying these tasks without a need for task-speci-c design, losses, or pre-computed detections.
Local-Global Context Aware Transformer for Language-Guided Video Segmentation
TLDR
LOCATER (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner, and outperforms previous state-of-the-arts solutions.
...
1
2
3
...

References

SHOWING 1-10 OF 67 REFERENCES
G3raphGround: Graph-Based Language Grounding
TLDR
The model, which is called GraphGround, uses graphs to formulate more complex, non-sequential dependencies among proposal image regions and phrases, and captures intra-modal dependencies using a separate graph neural network for each modality (visual and lingual), and then uses conditional message-passing in anothergraph neural network to fuse their outputs and capture cross- modal relationships.
A Fast and Accurate One-Stage Approach to Visual Grounding
TLDR
A simple, fast, and accurate one-stage approach to visual grounding that enables end-to-end joint optimization and shows great potential in terms of both accuracy and speed for both phrase localization and referring expression comprehension.
Learning to Assemble Neural Module Tree Networks for Visual Grounding
TLDR
A novel modular network called Neural Module Tree network (NMTree) is developed that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed.
Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding
TLDR
This paper introduces the diversity and discrimination simultaneously when generating proposals, and in doing so proposes Diversified and Discriminative Proposal Networks model (DDPN), a high performance baseline model for visual grounding and evaluates it on four benchmark datasets.
Real-Time Referring Expression Comprehension by Single-Stage Grounding Network
TLDR
The proposed Single-Stage Grounding network is time efficient and can ground a referring expression in a 416*416 image from the RefCOCO dataset in 25ms (40 referents per second) on average with a Nvidia Tesla P40, accomplishing more than 9* speedups over the existing multi-stage models.
Object Relational Graph With Teacher-Recommended Learning for Video Captioning
TLDR
This paper proposes an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation and designs a teacher-recommended learning method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
Grounding Referring Expressions in Images by Variational Context
TLDR
A variational Bayesian method to solve the problem of complex context modeling in referring expression grounding by exploiting the reciprocal relation between the referent and context, and thereby the search space of context can be greatly reduced.
Learning to Compose and Reason with Language Tree Structures for Visual Grounding
TLDR
A natural language grounding model that can automatically compose a binary tree structure for parsing the language and then perform visual reasoning along the tree in a bottom-up fashion and achieves the state-of-the-art performance with more explainable reasoning is proposed.
Zero-Shot Grounding of Objects From Natural Language Queries
TLDR
A new single-stage model called ZSGNet is proposed which combines the detector network and the grounding system and predicts classification scores and regression parameters and achieves state-of-the-art performance on Flickr30k and ReferIt under the usual “seen” settings and performs significantly better than baseline in the zero-shot setting.
...
1
2
3
4
5
...