Corpus ID: 235828808

Codified audio language modeling learns useful representations for music information retrieval

  title={Codified audio language modeling learns useful representations for music information retrieval},
  author={Rodrigo Castellon and Chris Donahue and Percy Liang},
We demonstrate that language models pre-trained on codified (discretely-encoded) music audio learn representations that are useful for downstream MIR tasks. Specifically, we explore representations from Jukebox [1]: a music generation system containing a language model trained on codified audio from 1M songs. To determine if Jukebox’s representations contain useful information for MIR, we use them as input features to train shallow models on several MIR tasks. Relative to representations from… Expand

Figures and Tables from this paper


Contrastive Learning of Musical Representations
This work introduces SimCLR to the music domain and contributes a large chain of audio data augmentations, to form a simple framework for self-supervised learning of raw waveforms of music: CLMR, which shows that its representations are transferable using out-of-domain datasets, indicating that they capture important musical knowledge. Expand
musicnn: Pre-trained convolutional neural networks for music audio tagging
The musicnn library contains a set of pre-trained musically motivated convolutional neural networks for music audio tagging, which can be used as out-of-the-box music audio taggers, as music feature extractors, or as pre- trained models for transfer learning. Expand
Enriched Music Representations With Multiple Cross-Modal Contrastive Learning
This paper aligns the latent representations obtained from playlists-track interactions, genre metadata, and the tracks’ audio, by maximizing the agreement between these modality representations using a contrastive loss. Expand
Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags
The results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations. Expand
Transfer Learning In Mir: Sharing Learned Latent Representations For Music Audio Classification And Similarity
The results show that shared representations can improve classification accuracy and it is shown how transfer learning can improve performance for music similarity. Expand
Transfer Learning by Supervised Pre-training for Audio-based Music Classification
It is shown that features learned from MSD audio fragments in a supervised manner, using tag labels and user listening data, consistently outperform features learned in an unsupervised manner in this setting, provided that the learned feature extractor is of limited complexity. Expand
Learning Features from Music Audio with Deep Belief Networks
This work presents a system that can automatically extract relevant features from audio for a given task by using a Deep Belief Network on Discrete Fourier Transforms of the audio to solve the task of genre recognition. Expand
One deep music representation to rule them all? A comparative analysis of different representation learning strategies
This investigation via an extensive empirical study that involves multiple learning sources, as well as multiple deep learning architectures with varying levels of information sharing between sources, in order to learn music representations yields insights into how to approach the design of methods for learning widely deployable deep data representations in the music domain. Expand
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
Large-Scale Weakly-Supervised Content Embeddings for Music Recommendation and Tagging
A hybrid training scheme that uses classification and metric learning losses to incorporate both metadata-derived text labels and aggregate co-listen supervisory signals into a single convolutional model that achieves state-of-the-art performance on two music tagging benchmarks. Expand