• Corpus ID: 237295184

Telling the What while Pointing to the Where: Multimodal Queries for Image Retrieval

  title={Telling the What while Pointing to the Where: Multimodal Queries for Image Retrieval},
  author={Soravit Changpinyo and Jordi Pont-Tuset and Vittorio Ferrari and Radu Soricut},
Most existing image retrieval systems use text queries as a way for the user to express what they are looking for. However, fine-grained image retrieval often requires the ability to also express where in the image the content they are looking for is. The text modality can only cumbersomely express such localization preferences, whereas pointing is a more natural fit. In this paper, we propose an image retrieval setup with a new form of multimodal queries, where the user simultaneously uses… 


Dialog-based Interactive Image Retrieval
A new approach to interactive image search that enables users to provide feedback via natural language, allowing for more natural and effective interaction, and achieves better accuracy than other supervised and reinforcement learning baselines.
Composing Text and Image for Image Retrieval - an Empirical Odyssey
  • Nam S. Vo, Lu Jiang, +4 authors James Hays
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
This paper proposes a new way to combine image and text through residual connection, that outperforms existing approaches on 3 different datasets, namely Fashion-200k, MIT-States and a new synthetic dataset the authors create based on CLEVR.
Efficient and Interactive Spatial-Semantic Image Retrieval
The experimental results show that the PQ is compatible with the FCN-based image retrieval system, and that the quantization process results in little information loss, and the method outperforms a conventional text-based search system.
Spatial-Semantic Image Search by Visual Feature Synthesis
A spatial-semantic image search technology that enables users to search for images with both semantic and spatial constraints by manipulating concept text-boxes on a 2D query canvas by train a convolutional neural network to synthesize appropriate visual features that captures the spatial- semantic constraints from the user canvas query.
Query by Semantic Sketch
A retrieval approach that allows to query visual media collections by sketching concept maps, thereby merging sketch-based retrieval with the search for semantic labels and integrated the semantic sketch query mode into the retrieval engine vitrivr and demonstrated its effectiveness.
Stacked Cross Attention for Image-Text Matching
Stacked Cross Attention to discover the full latent alignments using both image regions and words in sentence as context and infer the image-text similarity achieves the state-of-the-art results on the MS-COCO and Flickr30K datasets.
Image Search With Text Feedback by Visiolinguistic Attention Learning
This work proposes a composite transformer that can be seamlessly plugged in a CNN to selectively preserve and transform the visual features conditioned on language semantics, thus yielding an expressive representation for effective image search.
VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search
This paper proposes VisualSparta, a novel text-to-image retrieval model that shows substantial improvement over existing models on both accuracy and efficiency, and is capable of outperforming all previous scalable methods in MSCOCO and Flickr30K.
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
This paper presents Flickr30K Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes.
Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval
It is shown that scene graphs can be effectively created automatically from a natural language scene description and that using the output of the parsers is almost as effective as using human-constructed scene graphs.