• Corpus ID: 244954710

Learning music audio representations via weak language supervision

@article{Manco2021LearningMA,
  title={Learning music audio representations via weak language supervision},
  author={Ilaria Manco and Emmanouil Benetos and Elio Quinton and Gy{\"o}rgy Fazekas},
  journal={ArXiv},
  year={2021},
  volume={abs/2112.04214}
}
Audio representations for music information retrieval are typically learned via supervised learning in a task-specific fashion. Although effective at producing state-of-the-art results, this scheme lacks flexibility with respect to the range of applications a model can have and requires extensively annotated datasets. In this work, we pose the question of whether it may be possible to exploit weakly aligned text as the only supervisory signal to learn general-purpose music audio representations… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 31 REFERENCES
Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags
TLDR
The results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.
COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations
TLDR
The results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with the method are well correlated with some acoustic descriptors.
Pre-Training Audio Representations With Self-Supervision
TLDR
This work proposes two self-supervised tasks: Audio2Vec, which aims at reconstructing a spectrogram slice from past and future slices and TemporalGap, which estimates the distance between two short audio segments extracted at random from the same audio clip.
Transfer Learning by Supervised Pre-training for Audio-based Music Classification
TLDR
It is shown that features learned from MSD audio fragments in a supervised manner, using tag labels and user listening data, consistently outperform features learned in an unsupervised manner in this setting, provided that the learned feature extractor is of limited complexity.
Transfer Learning In Mir: Sharing Learned Latent Representations For Music Audio Classification And Similarity
TLDR
The results show that shared representations can improve classification accuracy and it is shown how transfer learning can improve performance for music similarity.
Codified audio language modeling learns useful representations for music information retrieval
TLDR
The strength of Jukebox’s representations are interpreted as evidence that modeling audio instead of tags provides richer representations for MIR.
Contrastive Learning of Musical Representations
TLDR
This work introduces SimCLR to the music domain and contributes a large chain of audio data augmentations, to form a simple framework for self-supervised learning of raw waveforms of music: CLMR, which shows that its representations are transferable using out-of-domain datasets, indicating that they capture important musical knowledge.
Enriched Music Representations With Multiple Cross-Modal Contrastive Learning
TLDR
This paper aligns the latent representations obtained from playlists-track interactions, genre metadata, and the tracks’ audio, by maximizing the agreement between these modality representations using a contrastive loss.
Representation Learning of Music Using Artist Labels
TLDR
This paper presents a feature learning approach that utilizes artist labels attached in every single music track as an objective meta data and trains a deep convolutional neural network to classify audio tracks into a large number of artists.
Multi-Task Self-Supervised Pre-Training for Music Classification
  • Ho-Hsiang Wu, Chieh-Chi Kao, Chao Wang
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
This paper applies self-supervised and multi-task learning methods for pre-training music encoders, and explores various design choices including encoder architectures, weighting mechanisms to combine losses from multiple tasks, and worker selections of pretext tasks to investigate how these design choices interact with various downstream music classification tasks.
...
1
2
3
4
...