• Corpus ID: 235828808

Codified audio language modeling learns useful representations for music information retrieval

  title={Codified audio language modeling learns useful representations for music information retrieval},
  author={Rodrigo Castellon and Chris Donahue and Percy Liang},
We demonstrate that language models pre-trained on codified (discretely-encoded) music audio learn representations that are useful for downstream MIR tasks. Specifically, we explore representations from Jukebox [1]: a music generation system containing a language model trained on codified audio from 1M songs. To determine if Jukebox’s representations contain useful information for MIR, we use them as input features to train shallow models on several MIR tasks. Relative to representations from… 

Figures and Tables from this paper

Learning music audio representations via weak language supervision
This work designs a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks and confirms that MuLaP effectively leverages audio-caption pairs to learn representations that are competitive with audio-only and cross-modal self-supervised methods in the literature.
Unsupervised Source Separation By Steering Pretrained Music Models
An unsupervised method that repurposes deep models trained for music generation and music tagging for audio source separation, without any retraining is presented, highlighting the vast and heretofore untapped potential of large pretrained music models for audio-to-audio tasks like source separation.
A method that repurposes deep models trained for music generation and music tagging for audio source separation, without any retraining is presented, pointing to the vast and heretofore untapped potential of large pretrained music models for audio-to-audio tasks like source separation.
MT3: Multi-Task Multitrack Music Transcription
This work demonstrates that a general-purpose Transformer model can perform multi-task AMT, jointly transcribing arbitrary combinations of musical instruments across several transcription datasets, dramatically improving performance for low-resource instruments while preserving strong performance for abundant instruments.
Use of Speaker Recognition Approaches for Learning and Evaluating Embedding Representations of Musical Instrument Sounds
A musical instrument recognition model that uses a SincNet front-end, a ResNet architecture, and an angular softmax objective function is constructed, which shows that including instrument family labels as a multi-task learning target can help to regularize the embedding space and incorporate useful structure.
Feature-informed Embedding Space Regularization For Audio Classification
The proposed regularization methods not only outperform baseline methods, but also can improve state-ofthe-art models on several audio classification tasks and suggest that using the mixture of features performs better than using individual features.
S3T: Self-Supervised Pre-training with Swin Transformer for Music Classification
S3T is the first method combining the Swin Transformer with a self-supervised learning method for music classification, aiming to learn meaningful music representations from massive easily accessible unlabeled music data.
Jointist: Joint Learning for Multi-instrument Transcription and Its Applications
Jointist is introduced, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip and it is argued that transcription models can be utilized as a preprocessing module for other music analysis tasks.
Adapting TTS models For New Speakers using Transfer Learning
It is found that fine-tuning a single-speaker TTS model on just 30 minutes of data, can yield comparable performance to a model trained from scratch on more than 27 hours of data for both male and female target speakers.
Sheet Sage is presented, a system designed to transcribe Western multitrack music into lead sheets: humanreadable scores which indicate melody and harmony, which can transcribe recordings into score representations which echo the musical understanding of human experts.


musicnn: Pre-trained convolutional neural networks for music audio tagging
The musicnn library contains a set of pre-trained musically motivated convolutional neural networks for music audio tagging, which can be used as out-of-the-box music audio taggers, as music feature extractors, or as pre- trained models for transfer learning.
Enriched Music Representations With Multiple Cross-Modal Contrastive Learning
This paper aligns the latent representations obtained from playlists-track interactions, genre metadata, and the tracks’ audio, by maximizing the agreement between these modality representations using a contrastive loss.
Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags
The results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.
Contrastive Learning of Musical Representations
It is shown that CLMR’s representations are transferable using out-of-domain datasets, indicating that the method has strong generalisability in music classification and to foster reproducibility and future research on self-supervised learning in music, the models and source code are publicly released.
Effectiveness of self-supervised pre-training for speech recognition
This work directly fine-tune the pre-trained BERT models on transcribed speech using a Connectionist Temporal Classification (CTC) loss instead of feeding the representations into a task-specific model, demonstrating that self-supervision can enable speech recognition systems trained on a near-zero amount of transcribed data.
Transfer Learning In Mir: Sharing Learned Latent Representations For Music Audio Classification And Similarity
The results show that shared representations can improve classification accuracy and it is shown how transfer learning can improve performance for music similarity.
Transfer Learning by Supervised Pre-training for Audio-based Music Classification
It is shown that features learned from MSD audio fragments in a supervised manner, using tag labels and user listening data, consistently outperform features learned in an unsupervised manner in this setting, provided that the learned feature extractor is of limited complexity.
Learning Features from Music Audio with Deep Belief Networks
This work presents a system that can automatically extract relevant features from audio for a given task by using a Deep Belief Network on Discrete Fourier Transforms of the audio to solve the task of genre recognition.
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.