UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling

@inproceedings{Yang2021UniTABUT,
  title={UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling},
  author={Zhengyuan Yang and Zhe Gan and Jianfeng Wang and Xiaowei Hu and Faisal Ahmed and Zicheng Liu and Yumao Lu and Lijuan Wang},
  booktitle={European Conference on Computer Vision},
  year={2021}
}
. We propose UniTAB that Unifies Text And Box outputs for grounded vision-language (VL) modeling. Grounded VL tasks such as grounded captioning require the model to generate a text description and align predicted words with object regions. To achieve this, models must generate desired text and box outputs together, and meanwhile indicate the alignments between words and boxes. In contrast to existing solutions that use multiple separate modules for different outputs, UniTAB represents both text… 

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

PEVL reformulates discretized object positions and language in a unified language modeling framework, which facilitates explicit VL alignment during pre-training, and also enables flexible prompt tuning for various downstream tasks.

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

This work proposes VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new VLP paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive box annotations.

Generalized Decoding for Pixel, Image, and Language

X-Decoder is presented, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly and decode different pixel- level and token-level outputs in the same semantic space.

ReCo: Region-Controlled Text-to-Image Generation

The proposed model, dubbed as ReCo (Region-Controlled T2I), enables the region control for arbitrary objects described by openended regional texts rather than by object labels from a constrained category set, and can better control the object count, spatial relationship, and region attributes such as color/size, with the free-form regional description.

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

This paper proposes Uni-Perceiver v2, the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance and proposes an improved optimizer to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training.

PromptCap: Prompt-Guided Task-Aware Image Captioning

Image captioning aims to describe an image with a natural language sentence, allowing powerful language models to understand images. The framework of combining image captioning with language models

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

UNIFIED-IO is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with no task-specific fine-tuning.

I-Tuning: Tuning Frozen Language Models with Image for Lightweight Image Captioning

I-Tuning, a lightweight image captioning framework, which contains only a small number of trainable parameters, and connects the non-trainable pre-trained language decoder GPT2 and vision encoder CLIP-ViT.

References

SHOWING 1-10 OF 86 REFERENCES

Pix2seq: A Language Modeling Framework for Object Detection

Pix2Seq is presented, a simple and generic framework for object detection that achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms.

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more images than the MS-COCO dataset (Lin et al., 2014) and represents a wider variety

Self-Critical Sequence Training for Image Captioning

This paper considers the problem of optimizing image captioning systems using reinforcement learning, and shows that by carefully optimizing systems using the test metrics of the MSCOCO task, significant gains in performance can be realized.

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language

Microsoft COCO Captions: Data Collection and Evaluation Server

The Microsoft COCO Caption dataset and evaluation server are described and several popular metrics, including BLEU, METEOR, ROUGE and CIDEr are used to score candidate captions.

Microsoft COCO: Common Objects in Context

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene

CIDEr: Consensus-based image description evaluation

A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated.

Im2Text: Describing Images Using 1 Million Captioned Photographs

A new objective performance measure for image captioning is introduced and methods incorporating many state of the art, but fairly noisy, estimates of image content are developed to produce even more pleasing results.

Finetuned Language Models Are Zero-Shot Learners

It is shown that instruction tuning —finetuning language models on a collection of datasets described via instructions—substantially improves zero-shot performance on unseen tasks and outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze.

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

(b)). These results suggest zero-shot cross-modality transfer emerges with the scaling of weakly labeled data.
...