• Corpus ID: 67855985

Let's Transfer Transformations of Shared Semantic Representations

@article{Vo2019LetsTT,
  title={Let's Transfer Transformations of Shared Semantic Representations},
  author={Nam S. Vo and Lu Jiang and James Hays},
  journal={ArXiv},
  year={2019},
  volume={abs/1903.00793}
}
With a good image understanding capability, can we manipulate the images high level semantic representation? Such transformation operation can be used to generate or retrieve similar images but with a desired modification (for example changing beach background to street background); similar ability has been demonstrated in zero shot learning, attribute composition and attribute manipulation image search. In this work we show how one can learn transformations with no training examples by… 

Figures and Tables from this paper

Embedding Arithmetic for Text-driven Image Transformation
TLDR
The SIMAT dataset is introduced to show that vanilla CLIP multimodal embeddings are not very well suited for text-driven image transformation, but that a simple finetuning on the COCO dataset can bring dramatic improvements.
Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval
TLDR
A unified Joint Visual Semantic Matching model that learns image-text compositional embeddings by jointly associating visual and textual modalities in a shared discriminative embedding space via compositional losses is proposed.

References

SHOWING 1-10 OF 37 REFERENCES
Composing Text and Image for Image Retrieval - an Empirical Odyssey
  • Nam S. Vo, Lu Jiang, James Hays
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
TLDR
This paper proposes a new way to combine image and text through residual connection, that outperforms existing approaches on 3 different datasets, namely Fashion-200k, MIT-States and a new synthetic dataset the authors create based on CLEVR.
Spatial-Semantic Image Search by Visual Feature Synthesis
TLDR
A spatial-semantic image search technology that enables users to search for images with both semantic and spatial constraints by manipulating concept text-boxes on a 2D query canvas by train a convolutional neural network to synthesize appropriate visual features that captures the spatial- semantic constraints from the user canvas query.
DeViSE: A Deep Visual-Semantic Embedding Model
TLDR
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.
Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search
TLDR
A novel memory-augmented Attribute Manipulation Network (AMNet) which can manipulate image representation at the attribute level and achieve remarkably good performance compared with well-designed baselines in terms of effectiveness of attribute manipulation and search accuracy.
Learning Deep Representations of Fine-Grained Visual Descriptions
TLDR
This model achieves strong performance on zero-shot text-based image retrieval and significantly outperforms the attribute-based state-of-the-art for zero- shot classification on the Caltech-UCSD Birds 200-2011 dataset.
Image Generation from Scene Graphs
TLDR
This work proposes a method for generating images from scene graphs, enabling explicitly reasoning about objects and their relationships, and validates this approach on Visual Genome and COCO-Stuff.
WhittleSearch: Image search with relative attribute feedback
TLDR
A novel mode of feedback for image search, where a user describes which properties of exemplar images should be adjusted in order to more closely match his/her mental model of the image(s) sought, which outperforms traditional binary relevance feedback in terms of search speed and accuracy.
Learning Attribute Representations with Localization for Flexible Fashion Search
TLDR
The FashionSearchNet is proposed, which uses a weakly supervised localization method to extract regions of attributes and can be ignored thus improving the similarity learning and outperforms the most recent fashion search techniques.
Dialog-based Interactive Image Retrieval
TLDR
A new approach to interactive image search that enables users to provide feedback via natural language, allowing for more natural and effective interaction, and achieves better accuracy than other supervised and reinforcement learning baselines.
Label-Embedding for Attribute-Based Classification
TLDR
This work proposes to view attribute-based image classification as a label-embedding problem: each class is embedded in the space of attribute vectors, and introduces a function which measures the compatibility between an image and a label embedding.
...
...