Probabilistic Compositional Embeddings for Multimodal Image Retrieval

  title={Probabilistic Compositional Embeddings for Multimodal Image Retrieval},
  author={Andrei Neculai and Yanbei Chen and Zeynep Akata},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
Existing works in image retrieval often consider retrieving images with one or two query inputs, which do not generalize to multiple queries. In this work, we investigate a more challenging scenario for composing multiple multi-modal queries in image retrieval. Given an arbitrary number of query images and (or) texts, our goal is to retrieve target images containing the semantic concepts specified in multiple multimodal queries. To learn an informative embedding that can flexibly encode the… 

Figures and Tables from this paper



Composing Text and Image for Image Retrieval - an Empirical Odyssey

  • Nam S. VoLu Jiang James Hays
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
This paper proposes a new way to combine image and text through residual connection, that outperforms existing approaches on 3 different datasets, namely Fashion-200k, MIT-States and a new synthetic dataset the authors create based on CLEVR.

Multimodal Residual Learning for Visual QA

This work presents Multimodal Residual Networks (MRN) for the multimodal residual learning of visual question-answering, which extends the idea of the deep residual learning.

Probabilistic Embeddings for Cross-Modal Retrieval

It is argued that deterministic functions are not sufficiently powerful to capture one-to-many correspondences and proposed Probabilistic Cross-Modal Embedding (PCME), where samples from the different modalities are represented as probabilistic distributions in the common embedding space.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Products and Convolutions of Gaussian Probability Density Functions

It is well known that the product and the convolution of two Gaussian probability density functions (PDFs) are also Gaussian. This memo provides derivations for the mean and standard deviation of the

Composed Query Image Retrieval Using Locally Bounded Features

  • M. HosseinzadehYang Wang
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
This paper proposes a novel method that represents the image using a set of local areas in the image, allowing the model to accurately correlate the modification text to parts of the image.

Task-Driven Modular Networks for Zero-Shot Compositional Learning

This study focuses on the problem of compositional zero-shot classification of object-attribute categories and shows that current evaluation metrics are flawed as they only consider unseen object- attribute pairs.

Benchmark for Compositional Text-to-Image Synthesis

This work presents the first systematic study of text-to-image generation on zero-shot compositional splits targeting two scenarios, unseen object-color and object-shape phrases, and proposes a new metric, based on a powerful vision-and-language CLIP model, which is leverage to compute R-Precision.

Attention Bottlenecks for Multimodal Fusion

This work introduces a novel transformer based architecture that uses ‘fusion bottlenecks’ for modality fusion at multiple layers, and shows that such a strategy improves fusion performance, at the same time reducing computational cost.

CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback

An in-depth view of the CoSMo algorithm and its design choices is provided, and it is shown that it accomplishes outstanding performance on multiple image-text retrieval benchmarks.