Visual In-Context Learning for Large Vision-Language Models

Yucheng Zhou; Xiang Li; Qianning Wang; Jianbing Shen

DOI:10.48550/arXiv.2402.11574
Corpus ID: 267750174

Visual In-Context Learning for Large Vision-Language Models

@inproceedings{Zhou2024VisualIL,
  title={Visual In-Context Learning for Large Vision-Language Models},
  author={Yucheng Zhou and Xiang Li and Qianning Wang and Jianbing Shen},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
  year={2024},
  url={https://api.semanticscholar.org/CorpusID:267750174}
}

Yucheng ZhouXiang Li Jianbing Shen
Published in Annual Meeting of the… 18 February 2024
Computer Science

This work introduces a novel Visual In-Context Learning method comprising Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition, which retrieves images via ''Retrieval&Rerank'' paradigm, summarises images with task intent and task-specific visual parsing, and composes language-based demonstrations that reduce token count and alleviate cross-modal interaction problem.

[PDF] Semantic Reader

96 Citations

Highly Influential Citations

Background Citations

Methods Citations

Results Citations

Figures and Tables from this paper

Topics

In-context Learning Large Vision Language Models Parsing Information-Flow Analysis Representation Disparity ViCL Visual In-Context Learning

Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations

Yanshu Li

Computer Science, Engineering

ArXiv

2025

SabER, a lightweight yet powerful decoder-only transformer equipped with task-aware attention, which intelligently selects and arranges ICDs from a demonstration library in an autoregressive fashion is proposed, enabling fine-grained feature extraction and cross-modal reasoning and iteratively refining task mapping to generate high-quality ICD sequences.

Visual In-Context Learning for Large Vision-Language Models

Figures and Tables from this paper

Topics

96 Citations

Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations

VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning

Optimizing Vision-Language Interactions Through Decoder-Only Models

Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks

Context-Aware Multi-Turn Visual-Textual Reasoning in LVLMs via Dynamic Memory and Adaptive Visual Guidance

MPCAR: Multi-Perspective Contextual Augmentation for Enhanced Visual Reasoning in Large Vision-Language Models

Large Visual-Language Models Are Also Good Classifiers: A Study of In-Context Multimodal Fake News Detection

Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

Large Vision-Language Models for Remote Sensing Visual Question Answering

51 References

Unifying Vision-and-Language Tasks via Text Generation

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Exploring Effective Factors for Improving Visual In-Context Learning

Learning Transferable Visual Models From Natural Language Supervision

What Makes Good Examples for Visual In-Context Learning?

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning

Understanding Emergent In-Context Learning from a Kernel Regression Perspective

In-Context Learning with Iterative Demonstration Selection

ICD-LM: Configuring Vision-Language In-Context Demonstrations by Language Modeling

Related Papers