Multi-Task Self-Supervised Pre-Training for Music Classification

  title={Multi-Task Self-Supervised Pre-Training for Music Classification},
  author={Ho-Hsiang Wu and Chieh-Chi Kao and Qingming Tang and Ming Sun and Brian McFee and Juan Pablo Bello and Chao Wang},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Ho-Hsiang WuChieh-Chi Kao Chao Wang
  • Published 5 February 2021
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Deep learning is very data hungry, and supervised learning especially requires massive labeled data to work well. Machine listening research often suffers from limited labeled data problem, as human annotations are costly to acquire, and annotations for audio are time consuming and less intuitive. Besides, models learned from labeled dataset often embed biases specific to that particular dataset. Therefore, unsupervised learning techniques become popular approaches in solving machine listening… 

Figures from this paper

Pretext Tasks Selection for Multitask Self-Supervised Audio Representation Learning

The method to select a group of pretext tasks among a set of candidates is introduced and the groups selected and weighted with the method perform better than classic baselines, thus facilitating the selection and combination of relevant pretext-task labels for self-supervised representation learning.

Learning Music Audio Representations Via Weak Language Supervision

This work designs a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks and confirms that MuLaP effectively leverages audio-caption pairs to learn representations that are competitive with audio-only and cross-modal self-supervised methods in the literature.

Contrastive Learning with Positive-Negative Frame Mask for Music Representation

A novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music and the experimental results of two music-related downstream tasks, music classification and cover song identification, demonstrate the generalization ability and transferability of music representation learned by PEMR.

Self-Supervised Beat Tracking in Musical Signals with Polyphonic Contrastive Learning

This work presents a new self-supervised learning pretext task for beat tracking and downbeat estimation that is notably one of the first works to use audio source separation as a fundamental component of selfsupervision.

MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding

An attempt to employ the mask language modeling approach of BERT to pre-train a 12-layer Transformer model for tackling a number of symbolic-domain discriminative music understanding tasks, finding that, given a pretrained Transformer, the models outperform recurrent neural network based baselines with less than 10 epochs of fine-tuning.

Learning Music Representations with wav2vec 2.0

The results show that wav2vec 2.0 pre-trained on music data allows us to achieve promising results on music classification tasks that are competitive with prior work on audio representations.

Sound and Visual Representation Learning with Multiple Pretraining Tasks

The experiments on binaural sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models and fully supervised models in the downstream task performance.

Wav2CLIP: Learning Robust Audio Representations from Clip

Wav2CLIP is proposed, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP), and is more efficient to pre-train than competing methods as it does not require learning a visual model in concert with an auditory model.

Instance Selection for Music Genre Classification using Heterogeneous Networks

This work introduces musical data instance selection into heterogeneous network models and proposes and evaluates ten different heterogeneous networks to identify more representative relationships with various musical features related, including songs, artists, genres, and melspectrogram.

Spectrograms Are Sequences of Patches

This work treats a spectrogram of music as a series of patches and design a self-supervised model that captures the features of these sequential patches: Patchifier, which makes good use of self- supervised learning methods from both NLP and CV domains.



Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

Experiments show that the proposed improved self-supervised method can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues.

Multi-Task Self-Supervised Learning for Robust Speech Recognition

PASE+ is proposed, an improved version of PASE that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks and learns transferable representations suitable for highly mismatched acoustic conditions.

Unsupervised Learning of Local Features for Music Classification

It is shown that convolutional extraction of local feature responses is crucial to reach high performance and simple and fast learning techniques such as k-means or randomly selected features are competitive with previously published results which also learn features from audio signals.

Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings

This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key.

Multitask Learning for Frame-level Instrument Recognition

A large-scale dataset that contains synthetic polyphonic music with frame-level pitch and instrument labels is presented and a simple yet novel network architecture is proposed to jointly predict the Pitch and instrument for each frame and the effectiveness of the proposed method is validated.

Big Self-Supervised Models are Strong Semi-Supervised Learners

The proposed semi-supervised learning algorithm can be summarized in three steps: unsupervised pretraining of a big ResNet model using SimCLRv2 (a modification of SimCLRs), supervised fine-tuning on a few labeled examples, and distillation with unlabeled examples for refining and transferring the task-specific knowledge.

FMA: A Dataset for Music Analysis

The Free Music Archive is introduced, an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections, and some suitable MIR tasks are discussed.

OpenMIC-2018: An Open Data-set for Multiple Instrument Recognition

The construction of a new, open data-set for multi-instrument recognition, which contains 20,000 examples of Creative Commons-licensed music available on the Free Music Archive, and how the instrument taxonomy was constructed is described.

One deep music representation to rule them all? A comparative analysis of different representation learning strategies

This investigation via an extensive empirical study that involves multiple learning sources, as well as multiple deep learning architectures with varying levels of information sharing between sources, in order to learn music representations yields insights into how to approach the design of methods for learning widely deployable deep data representations in the music domain.

Transfer Learning for Music Classification and Regression Tasks

This paper proposes to use a pre-trained convnet feature, a concatenated feature vector using the activations of feature maps of multiple layers in a trained convolutional network, and shows how it can serve as general-purpose music representation.