Visual In-Context Learning for Large Vision-Language Models

@inproceedings{Zhou2024VisualIL,
  title={Visual In-Context Learning for Large Vision-Language Models},
  author={Yucheng Zhou and Xiang Li and Qianning Wang and Jianbing Shen},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
  year={2024},
  url={https://api.semanticscholar.org/CorpusID:267750174}
}
This work introduces a novel Visual In-Context Learning method comprising Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition, which retrieves images via ''Retrieval&Rerank'' paradigm, summarises images with task intent and task-specific visual parsing, and composes language-based demonstrations that reduce token count and alleviate cross-modal interaction problem.

Figures and Tables from this paper

Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations

SabER, a lightweight yet powerful decoder-only transformer equipped with task-aware attention, which intelligently selects and arranges ICDs from a demonstration library in an autoregressive fashion is proposed, enabling fine-grained feature extraction and cross-modal reasoning and iteratively refining task mapping to generate high-quality ICD sequences.

VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning

This work proposes VCM, an end-to-end self-supervised visual concept modeling framework that leverages implicit contrastive learning across multiple sampled instances and vision-language fine-tuning to construct a visual concept model without requiring costly concept-level annotations.

Optimizing Vision-Language Interactions Through Decoder-Only Models

MUDAIF (Multimodal Unified Decoder with Adaptive Input Fusion), a decoder-only vision-language model that seamlessly integrates visual and textual inputs through a novel Vision-Token Adapter and adaptive co-attention mechanism, is proposed, establishing it as a new standard in encoder-free vision-language models.

Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks

A novel framework that leverages Large Language Models to dynamically generate textual prompts from visual inputs, guiding high-fidelity image synthesis is proposed, making it a versatile solution for in-domain and out-of-domain tasks.

Large Visual-Language Models Are Also Good Classifiers: A Study of In-Context Multimodal Fake News Detection

The experimental results suggest that the IMFND framework significantly boosts the FND efficiency of LVLMs, achieving enhanced accuracy over the standard ICL approach across three publicly available FND datasets.

Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes

This work proposes the Vision-Aware Retrieval-Augmented Prompting (VRAP) framework, a generative approach that enhances Large Vision-Language Models by integrating retrieval-augmented object tags into their prompts, and demonstrates that VRAP is a robust and efficient framework for advancing object-aware multimodal reasoning.

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

This work proposes VisualCloze, a universal image generation framework that supports a wide range of in-domain tasks, generalization to unseen ones, unseen unification of multiple tasks, and reverse generation, and introduces Graph200K, a graph-structured dataset that establishes various interrelated tasks, enhancing task density and transferable knowledge.

Large Vision-Language Models for Remote Sensing Visual Question Answering

This paper proposes a novel method that leverages a generative Large Vision-Language Model (LVLM) to streamline the RSVQA process and enables the LVLM to generate natural language answers by conditioning on both visual and textual inputs, without the need for predefined answer categories.

Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives

An enhanced framework that integrates a Causal-Temporal Reasoning Module (CTRM) into state-of-the-art LVLMs, and a multi-stage learning strategy to optimize the model, combining pre-training on large-scale video-text datasets, fine-tuning on causally annotated data, and contrastive alignment for better embedding coherence.

Incomplete In-context Learning

IJIP demonstrates considerable performance across two LVLMs and two datasets under three distinct conditions of label incompleteness, achieving the highest accuracy of 93.9% and can be directly applied to Prompt Learning and is adaptable to the text domain.
...

Unifying Vision-and-Language Tasks via Text Generation

This work proposes a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where the models learn to generate labels in text based on the visual and textual inputs.

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

The Qwen-VL series is introduced, a set of large-scale vision-language models designed to perceive and understand both text and images that outperforms existing Large Vision Language Models (LVLMs).

Exploring Effective Factors for Improving Visual In-Context Learning

This paper proposes a simple framework prompt-SelF, which outperformed OSLSM method-based meta-learning in 1-shot segmentation for the first time and indicated the great potential of visual in-context learning.

Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

What Makes Good Examples for Visual In-Context Learning?

This paper presents an unsupervised prompt retrieval method based on nearest example search using an off-the-shelf model, and a supervised Prompt retrieval method, which trains a neural network to choose examples that directly maximize in-context learning performance.

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

This paper shows that ground truth demonstrations are in fact not required and that other aspects of the demonstrations are the key drivers of end task performance, including the fact that they provide a few examples of the label space, the distribution of the input text, and the overall format of the sequence.

Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning

An anchor re-weighting method to improve ICL performance, a demonstration compression technique to expedite inference, and an analysis framework for diagnosing ICL errors in GPT2-XL are introduced.

In-Context Learning with Iterative Demonstration Selection

Iterative Demonstration Selection (IDS) iteratively selects examples that are diverse but still strongly correlated with the test sample as ICL demonstrations, and can consistently outperform existing ICL demonstration selection methods.

ICD-LM: Configuring Vision-Language In-Context Demonstrations by Language Modeling

This paper studies how to configure powerful In-Context Demonstration (ICD) sequences for a Large Vision-Language Model (LVLM) to solve Vision-Language tasks through In-Context Learning (ICL) and introduces an ICD Language Model specifically designed to generate effective ICD sequences.

Improving Language Understanding by Generative Pre-Training

The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.
...