Integrating Language Guidance into Vision-based Deep Metric Learning

@article{Roth2022IntegratingLG,
  title={Integrating Language Guidance into Vision-based Deep Metric Learning},
  author={Karsten Roth and Oriol Vinyals and Zeynep Akata},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.08543}
}
Deep Metric Learning (DML) proposes to learn metric spaces which encode semantic similarities as embedding space distances. These spaces should be transferable to classes beyond those seen during training. Commonly, DML methods task networks to solve contrastive ranking tasks defined over binary class assignments. However, such approaches ignore higher-level semantic relations between the actual classes. This causes learned embedding spaces to encode incomplete semantic context and misrepresent… 

Figures and Tables from this paper

A Non-isotropic Probabilistic Take on Proxy-based Deep Metric Learning
TLDR
This work introduces non-isotropic probabilistic proxy-based DML, which derives non- isotropic von Mises-Fisher distributions for class proxies to better represent complex class-specific variances and measures the proxy-to-image distance between these models.

References

SHOWING 1-10 OF 127 REFERENCES
Sharing Matters for Generalization in Deep Metric Learning
TLDR
Experiments show that, independent of the underlying network architecture and the specific ranking loss, the approach significantly improves performance in deep metric learning, leading to new the state-of-the-art results on various standard benchmark datasets.
Learning with Memory-based Virtual Classes for Deep Metric Learning
TLDR
This work presents a novel training strategy for DML called MemVir, which embeds the idea of curriculum learning by slowly adding virtual classes for a gradual increase in learning difficulty, which improves the learning stability as well as the final performance.
Deep Compositional Metric Learning
TLDR
A deep compositional metric learning (DCML) framework for effective and generalizable similarity measurement between images is proposed and a set of learnable compositors are employed to combine the sub-embeddings and use a self-reinforced loss to train the compositors, which serve as relays to distribute the diverse training signals to avoid destroying the discrimination ability.
Attention-based Ensemble for Deep Metric Learning
TLDR
An attention-based ensemble, which uses multiple attention masks so that each learner can attend to different parts of the object, which outperforms the state-of-the-art methods by a significant margin on image retrieval tasks.
DiVA: Diverse Visual Feature Aggregation for Deep Metric Learning
TLDR
This work proposes and studies multiple complementary learning tasks, targeting conceptually different data relationships by only resorting to the available training samples and labels of a standard DML setting, resulting in strong generalization and state-of-the-art performance on multiple established DML benchmark datasets.
Making Classification Competitive for Deep Metric Learning
TLDR
It is demonstrated that a standard classification network can be transformed into a variant of proxy-based metric learning that is competitive against non-parametric approaches across a wide variety of image retrieval tasks and can learn high-dimensional binary embeddings that achieve new state-of-the-art performance.
Revisiting Training Strategies and Generalization Performance in Deep Metric Learning
TLDR
A simple, yet effective, training regularization is proposed to reliably boost the performance of ranking-based DML models on various standard benchmark datasets.
DarkRank: Accelerating Deep Metric Learning via Cross Sample Similarities Transfer
TLDR
This work introduces a new type of knowledge---cross sample similarities for model compression and acceleration that can be naturally derived from deep metric learning model and brings the "learning to rank" technique intoDeep metric learning formulation.
Distilling Audio-Visual Knowledge by Compositional Contrastive Learning
TLDR
The main idea is to learn a compositional embedding that closes the cross-modal semantic gap and captures the task-relevant semantics, which facilitates pulling together representations across modalities by compositional contrastive learning.
DeViSE: A Deep Visual-Semantic Embedding Model
TLDR
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.
...
...