Learning music audio representations via weak language supervision

  title={Learning music audio representations via weak language supervision},
  author={Ilaria Manco and Emmanouil Benetos and Elio Quinton and Gy{\"o}rgy Fazekas},
Audio representations for music information retrieval are typically learned via supervised learning in a task-specific fashion. Although effective at producing state-of-the-art results, this scheme lacks flex-ibility with respect to the range of applications a model can have and requires extensively annotated datasets. In this work, we pose the question of whether it may be possible to exploit weakly aligned text as the only supervisory signal to learn general-purpose music audio… 

Figures and Tables from this paper


COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations
The results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with the method are well correlated with some acoustic descriptors.
Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags
The results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.
Pre-Training Audio Representations With Self-Supervision
This work proposes two self-supervised tasks: Audio2Vec, which aims at reconstructing a spectrogram slice from past and future slices and TemporalGap, which estimates the distance between two short audio segments extracted at random from the same audio clip.
Contrastive Learning of Musical Representations
It is shown that CLMR’s representations are transferable using out-of-domain datasets, indicating that the method has strong generalisability in music classification and to foster reproducibility and future research on self-supervised learning in music, the models and source code are publicly released.
Transfer Learning In Mir: Sharing Learned Latent Representations For Music Audio Classification And Similarity
The results show that shared representations can improve classification accuracy and it is shown how transfer learning can improve performance for music similarity.
Transfer Learning by Supervised Pre-training for Audio-based Music Classification
It is shown that features learned from MSD audio fragments in a supervised manner, using tag labels and user listening data, consistently outperform features learned in an unsupervised manner in this setting, provided that the learned feature extractor is of limited complexity.
Codified audio language modeling learns useful representations for music information retrieval
The strength of Jukebox’s representations are interpreted as evidence that modeling audio instead of tags provides richer representations for MIR.
Enriched Music Representations With Multiple Cross-Modal Contrastive Learning
This paper aligns the latent representations obtained from playlists-track interactions, genre metadata, and the tracks’ audio, by maximizing the agreement between these modality representations using a contrastive loss.
Representation Learning of Music Using Artist Labels
This paper presents a feature learning approach that utilizes artist labels attached in every single music track as an objective meta data and trains a deep convolutional neural network to classify audio tracks into a large number of artists.
FMA: A Dataset for Music Analysis
The Free Music Archive is introduced, an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections, and some suitable MIR tasks are discussed.