Multi-Task Self-Supervised Pre-Training for Music Classification
@article{Wu2021MultiTaskSP, title={Multi-Task Self-Supervised Pre-Training for Music Classification}, author={Ho-Hsiang Wu and Chieh-Chi Kao and Qingming Tang and Ming Sun and Brian McFee and Juan Pablo Bello and Chao Wang}, journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2021}, pages={556-560} }
Deep learning is very data hungry, and supervised learning especially requires massive labeled data to work well. Machine listening research often suffers from limited labeled data problem, as human annotations are costly to acquire, and annotations for audio are time consuming and less intuitive. Besides, models learned from labeled dataset often embed biases specific to that particular dataset. Therefore, unsupervised learning techniques become popular approaches in solving machine listening…
13 Citations
Pretext Tasks Selection for Multitask Self-Supervised Audio Representation Learning
- Computer ScienceIEEE Journal of Selected Topics in Signal Processing
- 2022
The method to select a group of pretext tasks among a set of candidates is introduced and the groups selected and weighted with the method perform better than classic baselines, thus facilitating the selection and combination of relevant pretext-task labels for self-supervised representation learning.
Learning Music Audio Representations Via Weak Language Supervision
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
This work designs a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks and confirms that MuLaP effectively leverages audio-caption pairs to learn representations that are competitive with audio-only and cross-modal self-supervised methods in the literature.
Contrastive Learning with Positive-Negative Frame Mask for Music Representation
- Computer ScienceWWW
- 2022
A novel contrastive learning objective to accommodate both self-augmented positives/negatives sampled from the same music and the experimental results of two music-related downstream tasks, music classification and cover song identification, demonstrate the generalization ability and transferability of music representation learned by PEMR.
Self-Supervised Beat Tracking in Musical Signals with Polyphonic Contrastive Learning
- Computer Science
- 2022
This work presents a new self-supervised learning pretext task for beat tracking and downbeat estimation that is notably one of the first works to use audio source separation as a fundamental component of selfsupervision.
MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding
- Computer ScienceArXiv
- 2021
An attempt to employ the mask language modeling approach of BERT to pre-train a 12-layer Transformer model for tackling a number of symbolic-domain discriminative music understanding tasks, finding that, given a pretrained Transformer, the models outperform recurrent neural network based baselines with less than 10 epochs of fine-tuning.
Learning Music Representations with wav2vec 2.0
- Computer ScienceArXiv
- 2022
The results show that wav2vec 2.0 pre-trained on music data allows us to achieve promising results on music classification tasks that are competitive with prior work on audio representations.
Sound and Visual Representation Learning with Multiple Pretraining Tasks
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
The experiments on binaural sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models and fully supervised models in the downstream task performance.
Wav2CLIP: Learning Robust Audio Representations from Clip
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
Wav2CLIP is proposed, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP), and is more efficient to pre-train than competing methods as it does not require learning a visual model in concert with an auditory model.
Instance Selection for Music Genre Classification using Heterogeneous Networks
- Computer ScienceAnais do XVIII Simpósio Brasileiro de Computação Musical (SBCM 2021)
- 2021
This work introduces musical data instance selection into heterogeneous network models and proposes and evaluates ten different heterogeneous networks to identify more representative relationships with various musical features related, including songs, artists, genres, and melspectrogram.
Spectrograms Are Sequences of Patches
- Computer ScienceArXiv
- 2022
This work treats a spectrogram of music as a series of patches and design a self-supervised model that captures the features of these sequential patches: Patchifier, which makes good use of self- supervised learning methods from both NLP and CV domains.
References
SHOWING 1-10 OF 31 REFERENCES
Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
- Computer ScienceINTERSPEECH
- 2019
Experiments show that the proposed improved self-supervised method can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues.
Multi-Task Self-Supervised Learning for Robust Speech Recognition
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
PASE+ is proposed, an improved version of PASE that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks and learns transferable representations suitable for highly mismatched acoustic conditions.
Unsupervised Learning of Local Features for Music Classification
- Computer ScienceISMIR
- 2012
It is shown that convolutional extraction of local feature responses is crucial to reach high performance and simple and fast learning techniques such as k-means or randomly selected features are competitive with previously published results which also learn features from audio signals.
Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key.
Multitask Learning for Frame-level Instrument Recognition
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
A large-scale dataset that contains synthetic polyphonic music with frame-level pitch and instrument labels is presented and a simple yet novel network architecture is proposed to jointly predict the Pitch and instrument for each frame and the effectiveness of the proposed method is validated.
Big Self-Supervised Models are Strong Semi-Supervised Learners
- Computer ScienceNeurIPS
- 2020
The proposed semi-supervised learning algorithm can be summarized in three steps: unsupervised pretraining of a big ResNet model using SimCLRv2 (a modification of SimCLRs), supervised fine-tuning on a few labeled examples, and distillation with unlabeled examples for refining and transferring the task-specific knowledge.
FMA: A Dataset for Music Analysis
- Computer ScienceISMIR
- 2017
The Free Music Archive is introduced, an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections, and some suitable MIR tasks are discussed.
OpenMIC-2018: An Open Data-set for Multiple Instrument Recognition
- Computer ScienceISMIR
- 2018
The construction of a new, open data-set for multi-instrument recognition, which contains 20,000 examples of Creative Commons-licensed music available on the Free Music Archive, and how the instrument taxonomy was constructed is described.
One deep music representation to rule them all? A comparative analysis of different representation learning strategies
- Computer ScienceNeural Computing and Applications
- 2019
This investigation via an extensive empirical study that involves multiple learning sources, as well as multiple deep learning architectures with varying levels of information sharing between sources, in order to learn music representations yields insights into how to approach the design of methods for learning widely deployable deep data representations in the music domain.
Transfer Learning for Music Classification and Regression Tasks
- Computer ScienceISMIR
- 2017
This paper proposes to use a pre-trained convnet feature, a concatenated feature vector using the activations of feature maps of multiple layers in a trained convolutional network, and shows how it can serve as general-purpose music representation.