UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
@inproceedings{Yang2021UniTABUT, title={UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling}, author={Zhengyuan Yang and Zhe Gan and Jianfeng Wang and Xiaowei Hu and Faisal Ahmed and Zicheng Liu and Yumao Lu and Lijuan Wang}, booktitle={European Conference on Computer Vision}, year={2021} }
. We propose UniTAB that Unifies Text And Box outputs for grounded vision-language (VL) modeling. Grounded VL tasks such as grounded captioning require the model to generate a text description and align predicted words with object regions. To achieve this, models must generate desired text and box outputs together, and meanwhile indicate the alignments between words and boxes. In contrast to existing solutions that use multiple separate modules for different outputs, UniTAB represents both text…
8 Citations
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
- Computer ScienceArXiv
- 2022
PEVL reformulates discretized object positions and language in a unified language modeling framework, which facilitates explicit VL alignment during pre-training, and also enables flexible prompt tuning for various downstream tasks.
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
- Computer ScienceArXiv
- 2022
This work proposes VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations.
Generalized Decoding for Pixel, Image, and Language
- Computer ScienceArXiv
- 2022
X-Decoder is presented, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly and decode different pixel- level and token-level outputs in the same semantic space.
ReCo: Region-Controlled Text-to-Image Generation
- Computer ScienceArXiv
- 2022
The proposed model, dubbed as ReCo (Region-Controlled T2I), enables the region control for arbitrary objects described by openended regional texts rather than by object labels from a constrained category set, and can better control the object count, spatial relationship, and region attributes such as color/size, with the free-form regional description.
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
- Computer ScienceArXiv
- 2022
This paper proposes Uni-Perceiver v2, the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance and proposes an improved optimizer to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training.
PromptCap: Prompt-Guided Task-Aware Image Captioning
- Computer ScienceArXiv
- 2022
Image captioning aims to describe an image with a natural language sentence, allowing powerful language models to understand images. The framework of combining image captioning with language models…
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
- Computer ScienceArXiv
- 2022
UNIFIED-IO is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with no task-specific fine-tuning.
I-Tuning: Tuning Frozen Language Models with Image for Lightweight Image Captioning
- Computer Science
- 2022
I-Tuning, a lightweight image captioning framework, which contains only a small number of trainable parameters, and connects the non-trainable pre-trained language decoder GPT2 and vision encoder CLIP-ViT.
References
SHOWING 1-10 OF 86 REFERENCES
Pix2seq: A Language Modeling Framework for Object Detection
- Computer ScienceICLR
- 2022
Pix2Seq is presented, a simple and generic framework for object detection that achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
- Computer Science, Environmental ScienceACL
- 2018
We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more images than the MS-COCO dataset (Lin et al., 2014) and represents a wider variety…
Self-Critical Sequence Training for Image Captioning
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
This paper considers the problem of optimizing image captioning systems using reinforcement learning, and shows that by carefully optimizing systems using the test metrics of the MSCOCO task, significant gains in performance can be realized.
VQA: Visual Question Answering
- Computer Science2015 IEEE International Conference on Computer Vision (ICCV)
- 2015
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language…
Microsoft COCO Captions: Data Collection and Evaluation Server
- Computer ScienceArXiv
- 2015
The Microsoft COCO Caption dataset and evaluation server are described and several popular metrics, including BLEU, METEOR, ROUGE and CIDEr are used to score candidate captions.
Microsoft COCO: Common Objects in Context
- Computer ScienceECCV
- 2014
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene…
CIDEr: Consensus-based image description evaluation
- Computer Science2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2015
A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated.
Im2Text: Describing Images Using 1 Million Captioned Photographs
- Computer ScienceNIPS
- 2011
A new objective performance measure for image captioning is introduced and methods incorporating many state of the art, but fairly noisy, estimates of image content are developed to produce even more pleasing results.
Finetuned Language Models Are Zero-Shot Learners
- Computer ScienceICLR
- 2022
It is shown that instruction tuning —finetuning language models on a collection of datasets described via instructions—substantially improves zero-shot performance on unseen tasks and outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze.
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
- Computer ScienceICLR
- 2022
(b)). These results suggest zero-shot cross-modality transfer emerges with the scaling of weakly labeled data.