Multimodal Metric Learning for Tag-Based Music Retrieval

@article{Won2021MultimodalML,
  title={Multimodal Metric Learning for Tag-Based Music Retrieval},
  author={Minz Won and Sergio Oramas and Oriol Nieto and Fabien Gouyon and Xavier Serra},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={591-595}
}
  • Minz Won, Sergio Oramas, X. Serra
  • Published 30 October 2020
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Tag-based music retrieval is crucial to browse large-scale mu-sic libraries efficiently. Hence, automatic music tagging has been actively explored, mostly as a classification task, which has an inherent limitation: a fixed vocabulary. On the other hand, metric learning enables flexible vocabularies by using pretrained word embeddings as side information. Also, met-ric learning has proven its suitability for cross-modal retrieval tasks in other domains (e.g., text-to-image) by jointly learning a… 

Figures and Tables from this paper

Enriched Music Representations With Multiple Cross-Modal Contrastive Learning

TLDR
This paper aligns the latent representations obtained from playlists-track interactions, genre metadata, and the tracks’ audio, by maximizing the agreement between these modality representations using a contrastive loss.

Emotion Embedding Spaces for Matching Music to Stories

TLDR
The goal is to help creators find music to match the emotion of their story by leveraging data-driven embeddings on text-based stories that can be auralized, use multiple sentences as input queries, and automatically retrieve matching music.

Contrastive Audio-Language Learning for Music

TLDR
This work proposes MusCALL, a framework for Music Contrastive Audio-Language Learning, a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences, producing multimodal embeddings that can be used for text-to-audio and audio- to-text retrieval out-of-the-box.

Exploring modality-agnostic representations for music classification

TLDR
This work explores the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality.

E XPLORING MODALITY - AGNOSTIC REPRESENTATIONS FOR MUSIC CLASSIFICATION

TLDR
The use of cross-modal retrieval is explored as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality.

LEARNING A CROSS-DOMAIN EMBEDDING SPACE OF VOCAL AND MIXED AUDIO WITH A STRUCTURE-PRESERVING TRIPLET LOSS

TLDR
This paper proposes a method to learn a cross-domain embedding space between isolated vocal and mixed audio for vocal-centric MIR tasks, leveraging a pre-trained music source separation model and significantly improves the previous cross domain embedding model.

Semi-supervised Music Tagging Transformer

TLDR
This is the first attempt to utilize the entire audio of the million song dataset and shows that the proposed architecture outperforms the previous state-of-the-art music tagging models that are based on convolutional neural networks under a supervised scheme.

Analysis of Music Retrieval Based on Emotional Tags Environment

  • Nuan Bao
  • Computer Science
    Journal of environmental and public health
  • 2022
TLDR
Experiments demonstrate that the method suggested in this paper can better satisfy user retrieval needs than conventional cosine similarity and tag co-occurrence-based similarity methods and that the fusion of multiple methods is preferable to a single method.

Analysis ofMusicRetrievalBasedonEmotionalTagsEnvironment

TLDR
Experiments demonstrate that the method suggested in this paper can better satisfy user retrieval needs than conventional cosine similarity and tag co-occurrencebased similarity methods and that the fusion of multiple methods is preferable to a single method.

Learning Unsupervised Hierarchies of Audio Concepts

TLDR
This paper proposes a method to learn numerous music concepts from audio and then automatically hierarchise them to expose their mutual relationships, and shows that the mined hierarchies are aligned with both ground-truth hierarchies of concepts – when available – and with proxy sources of concept similarity in the general case.

References

SHOWING 1-10 OF 22 REFERENCES

Zero-shot Learning for Audio-based Music Classification and Tagging

TLDR
This work investigates the zero-shot learning in the music domain and organizes two different setups of side information using human-labeled attribute information based on Free Music Archive and OpenMIC-2018 datasets and general word semantic information from Million Song Dataset and this http URL tag annotations.

Cross-modal Embeddings for Video and Audio Retrieval

TLDR
This work is able to create links between audio and visual documents, by projecting them into a common region of the feature space, obtaining joint audio-visual embeddings that are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given a query audio.

Evaluation of CNN-based Automatic Music Tagging Models

TLDR
A consistent evaluation of different music tagging models on three datasets is conducted and reference results using common evaluation metrics are provided and all the models are evaluated with perturbed inputs to investigate the generalization capabilities concerning time stretch, pitch shift, dynamic range compression, and addition of white noise.

Multimodal Deep Learning for Music Genre Classification

TLDR
An approach to learn and combine multimodal data representations for music genre classification is proposed, and a proposed approach for dimensionality reduction of target labels yields major improvements in multi-label classification.

Mood Classification Using Listening Data

TLDR
It is shown that listening-based features outperform content-based ones when classifying moods: embeddings obtained through matrix factorization of listening data appear to be more informative of a track mood thanembeddings based on its audio content.

Learning Deep Structure-Preserving Image-Text Embeddings

This paper proposes a method for learning joint embeddings of images and text using a two-branch neural network with multiple layers of linear projections followed by nonlinearities. The network is

DeViSE: A Deep Visual-Semantic Embedding Model

TLDR
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.

GloVe: Global Vectors for Word Representation

TLDR
A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

Deep Metric Learning Using Triplet Network

TLDR
This paper proposes the triplet network model, which aims to learn useful representations by distance comparisons, and demonstrates using various datasets that this model learns a better representation than that of its immediate competitor, the Siamese network.

Sampling Matters in Deep Embedding Learning

TLDR
This paper proposes distance weighted sampling, which selects more informative and stable examples than traditional approaches, and shows that a simple margin based loss is sufficient to outperform all other loss functions.