Learning to Compose Soft Prompts for Compositional Zero-Shot Learning

  title={Learning to Compose Soft Prompts for Compositional Zero-Shot Learning},
  author={Nihal V. Nayak and Peilin Yu and Stephen H. Bach},
We introduce compositional soft prompting ( CSP ), a parameter-efficient learning technique to improve the zero-shot compositionality of large-scale pretrained vision-language models (VLMs) without the overhead of fine-tuning the entire model. VLMs can represent arbitrary classes as natural language prompts in their flexible text encoders but they underperform state-of-the-art methods on compositional zero-shot benchmark tasks. To improve VLMs, we propose a novel form of soft prompting. We treat… 


Learning to Prompt for Vision-Language Models
Context Optimization (CoOp) is proposed, a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition that requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements when using more shots.
Learning Graph Embeddings for Compositional Zero-shot Learning
A novel graph formulation called Compositional Graph Embedding (CGE) that learns image features, compositional classifiers and latent representations of visual primitives in an end-to-end manner and significantly outperforms the state of the art on MIT-States and UT-Zappos in the challenging generalized compositional zero-shot setting.
Fine-Grained Visual Comparisons with Local Learning
  • A. Yu, K. Grauman
  • Computer Science
    2014 IEEE Conference on Computer Vision and Pattern Recognition
  • 2014
This work proposes a local learning approach for fine-grained visual comparisons that outperforms state-of-the-art methods for relative attribute prediction and shows how to identify analogous pairs using learned metrics.
SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer
It is shown that SPoT significantly boosts the performance of Prompt Tuning across many tasks, and an efficient retrieval approach is proposed that interprets task prompts as task embeddings to identify similar tasks and predict the most transferable source tasks for a novel target task.
Independent Prototype Propagation for Zero-Shot Compositionality
ProtoProp, a novel prototype propagation graph method, is proposed that in the generalized compositional zero-shot setting the authors outperform state-of-the-art results, and through ablations they show the importance of each part of the method and their contribution to the final results.
Learning Graph Embeddings for Open World Compositional Zero-Shot Learning
This work proposes a new approach, Compositional Cosine Graph Embedding (Co-CGE), which achieves state-of-the-art performances in standard CZSL while outperforming previous methods in the open world scenario.
The Power of Scale for Parameter-Efficient Prompt Tuning
This work explores “prompt tuning”, a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks, and shows that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning.
Learning How to Ask: Querying LMs with Mixtures of Soft Prompts
This work explores the idea of learning prompts by gradient descent—either fine-tuning prompts taken from previous work, or starting from random initialization, showing that the implicit factual knowledge in language models was previously underestimated.
Open World Compositional Zero-Shot Learning
While the simple CZSL model achieves state-of-the-art performances in the closed world scenario, the feasibility scores boost the performance of the approach in the open world setting, clearly outperforming the previous state of the art.
Learning Transferable Visual Models From Natural Language Supervision
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.