Corpus ID: 237267002

Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training

  title={Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training},
  author={Ming Yan and Haiyang Xu and Chenliang Li and Bin Bi and Junfeng Tian and Min Gui and Wei Wang},
Existing approaches to vision-language pretraining (VLP) heavily rely on an object detector based on bounding boxes (regions), where salient objects are first detected from images and then a Transformer-based model is used for cross-modal fusion. Despite their superior performance, these approaches are bounded by the capability of the object detector in terms of both effectiveness and efficiency. Besides, the presence of object detection imposes unnecessary constraints on model designs and… Expand

Figures and Tables from this paper

Achieving Human Parity on Visual Question Answering
  • Ming Yan, Haiyang Xu, +14 authors Rong Jin
  • Computer Science
  • ArXiv
  • 2021
The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image. It has been a popular research topic with anExpand


LXMERT: Learning Cross-Modality Encoder Representations from Transformers
The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model. Expand
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
The Pixel-BERT which aligns semantic connection in pixel and text level solves the limitation of task-specific visual representation for vision and language tasks and relieves the cost of bounding box annotations and overcomes the unbalance between semantic labels in visual task and language semantic. Expand
Aggregated Residual Transformations for Deep Neural Networks
On the ImageNet-1K dataset, it is empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy and is more effective than going deeper or wider when the authors increase the capacity. Expand
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
ERNIE-ViL can model the joint representation characterizing the alignments of the detailed semantics across vision and language, which achieves state-of-the-art performance on 5 vision-language downstream tasks after fine-tuning ERNIE- ViL. Expand
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
The experimental results show that UNIMO significantly improves the performance of several singlemodal and multi-modal downstream tasks and can utilize much larger scale of data to learn more generalizable representations. Expand
2020a. Unicoder-vl: A universal encoder for vision and language by cross-modal pretraining
  • In Proceedings of the AAAI Conference on Artificial Intelligence,
  • 2020
In Defense of Grid Features for Visual Question Answering
This paper revisits grid features for VQA, and finds they can work surprisingly well -- running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion). Expand
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
A new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT), which adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. Expand
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features. Expand
Microsoft COCO: Common Objects in Context
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of sceneExpand