Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning

@article{Xie2022VisualCB,
  title={Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning},
  author={Yujia Xie and Luowei Zhou and Xiyang Dai and Lu Yuan and Nguyen Bach and Ce Liu and Michael Zeng},
  journal={ArXiv},
  year={2022},
  volume={abs/2206.01843}
}
People say, “ A picture is worth a thousand words ”. Then how can we get the rich information out of the image? We argue that by using visual clues to bridge large pretrained vision foundation models and language models, we can do so without any extra cross-modal training. Thanks to the strong zero-shot capability of foundation models, we start by constructing a rich semantic representation of the image ( e.g. , image tags, object attributes / locations, captions) as a structured textual prompt… 

Single-Stream Multi-Level Alignment for Vision-Language Pretraining

TLDR
A single stream model that aligns images and text on multiple levels, and explicitly self-supervises the visual modality with pseudo-label supervision to ensure similar levels of conceptual abstraction in the representations before fusion is proposed.

References

SHOWING 1-10 OF 56 REFERENCES

Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning

TLDR
A plug-and-play hierarchical-topic-guided image paragraph generation framework, which couples a visual extractor with a deep topic model to guide the learning of a language model to capture the correlations between the image and text at multiple levels of abstraction.

A Hierarchical Approach for Generating Descriptive Image Paragraphs

TLDR
A model that decomposes both images and paragraphs into their constituent parts is developed, detecting semantic regions in images and using a hierarchical recurrent neural network to reason about language.

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

TLDR
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss, and it is shown that the scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.

Flamingo: a Visual Language Model for Few-Shot Learning

TLDR
It is demonstrated that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples.

Unifying Vision-and-Language Tasks via Text Generation

TLDR
This work proposes a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where the models learn to generate labels in text based on the visual and textual inputs.

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

TLDR
This paper addresses the task of knowledge-based visual question answering and provides a benchmark, called OK-VQA, where the image content is not sufficient to answer the questions, encouraging methods that rely on external knowledge resources.

Florence: A New Foundation Model for Computer Vision

TLDR
By incorporating universal visual-language representations from Web-scale image-text data, the Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition.

Learning Transferable Visual Models From Natural Language Supervision

TLDR
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

TLDR
The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

TLDR
A novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which is called Winoground and aims for it to serve as a useful evaluation set for advancing the state of the art and driv-ing further progress in the industry.
...