Corpus ID: 233296838

TransVG: End-to-End Visual Grounding with Transformers

@article{Deng2021TransVGEV,
  title={TransVG: End-to-End Visual Grounding with Transformers},
  author={Jiajun Deng and Zhengyuan Yang and Tianlang Chen and Wen-gang Zhou and Houqiang Li},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.08541}
}
In this paper, we present a neat yet effective transformerbased framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. The state-of-the-art methods, including two-stage or one-stage ones, rely on a complex module with manually-designed mechanisms to perform the query reasoning and multi-modal fusion. However, the involvement of certain mechanisms in fusion module design, such as query decomposition and image… Expand

Figures and Tables from this paper

Referring Transformer: A One-step Approach to Multi-task Visual Grounding
TLDR
A simple one-stage multi-task framework for visual grounding tasks using a transformer architecture, where two modalities are fused in a visual-lingual encoder that outperform state-of-the-art methods by a large margin on both REC and RES tasks. Expand
Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling
TLDR
A vision-language (VL) model that unifies text generation and bounding box prediction into a single architecture that achieves comparable performance to task-specific state of the art on 7 VL benchmarks and shows the capability of generalizing to new tasks such as ImageNet object localization. Expand
SAT: 2D Semantics Assisted Training for 3D Visual Grounding
TLDR
2D Semantics Ass Training (SAT) is proposed that utilizes 2D image semantics in the training stage to ease point-cloud-language joint representation learning and assist 3D visual grounding. Expand
Transformers in Vision: A Survey
TLDR
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. Expand

References

SHOWING 1-10 OF 67 REFERENCES
G3raphGround: Graph-Based Language Grounding
TLDR
The model, which is called GraphGround, uses graphs to formulate more complex, non-sequential dependencies among proposal image regions and phrases, and captures intra-modal dependencies using a separate graph neural network for each modality (visual and lingual), and then uses conditional message-passing in anothergraph neural network to fuse their outputs and capture cross- modal relationships. Expand
Learning to Assemble Neural Module Tree Networks for Visual Grounding
TLDR
A novel modular network called Neural Module Tree network (NMTree) is developed that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed. Expand
Real-Time Referring Expression Comprehension by Single-Stage Grounding Network
TLDR
The proposed Single-Stage Grounding network is time efficient and can ground a referring expression in a 416*416 image from the RefCOCO dataset in 25ms (40 referents per second) on average with a Nvidia Tesla P40, accomplishing more than 9* speedups over the existing multi-stage models. Expand
A Fast and Accurate One-Stage Approach to Visual Grounding
TLDR
A simple, fast, and accurate one-stage approach to visual grounding that enables end-to-end joint optimization and shows great potential in terms of both accuracy and speed for both phrase localization and referring expression comprehension. Expand
Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding
TLDR
This paper introduces the diversity and discrimination simultaneously when generating proposals, and in doing so proposes Diversified and Discriminative Proposal Networks model (DDPN), a high performance baseline model for visual grounding and evaluates it on four benchmark datasets. Expand
Object Relational Graph With Teacher-Recommended Learning for Video Captioning
TLDR
This paper proposes an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation and designs a teacher-recommended learning method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model. Expand
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to aExpand
Grounding Referring Expressions in Images by Variational Context
TLDR
A variational Bayesian method to solve the problem of complex context modeling in referring expression grounding by exploiting the reciprocal relation between the referent and context, and thereby the search space of context can be greatly reduced. Expand
Learning to Compose and Reason with Language Tree Structures for Visual Grounding
TLDR
A natural language grounding model that can automatically compose a binary tree structure for parsing the language and then perform visual reasoning along the tree in a bottom-up fashion and achieves the state-of-the-art performance with more explainable reasoning is proposed. Expand
Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing
TLDR
A novel cross-modal attention-guided erasing approach is designed, where the most dominant information from either textual or visual domains is discarded to generate difficult training samples online, and to drive the model to discover complementary textual-visual correspondences. Expand
...
1
2
3
4
5
...