Jointist: Joint Learning for Multi-instrument Transcription and Its Applications

  title={Jointist: Joint Learning for Multi-instrument Transcription and Its Applications},
  author={Kin Wai Cheuk and Keunwoo Choi and Qiuqiang Kong and Bochen Li and Minz Won and Amy Hung and Ju-Chiang Wang and Dorien Herremans},
In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of the instrument recognition module that conditions the other modules: the transcription module that outputs instrument-specific piano rolls, and the source separation module that utilizes instrument information and transcription results. The instrument conditioning is designed for an… 

Figures and Tables from this paper



MT3: Multi-Task Multitrack Music Transcription

This work demonstrates that a general-purpose Transformer model can perform multi-task AMT, jointly transcribing arbitrary combinations of musical instruments across several transcription datasets, dramatically improving performance for low-resource instruments while preserving strong performance for abundant instruments.

A Unified Model for Zero-shot Music Source Separation, Transcription and Synthesis

Experimental results show that the model outperforms existing multi-task baselines, and the transcribed score serves as a powerful auxiliary for separation tasks.

Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments

We present a single deep learning architecture that can both separate an audio recording of a musical mixture into constituent single-instrument recordings and transcribe these instruments into a

Multi-Instrument Automatic Music Transcription With Self-Attention-Based Instance Segmentation

This article proposes a multi-instrument AMT method, with signal processing techniques specifying pitch saliency, novel deep learning techniques, and concepts partly inspired by multi-object recognition, instance segmentation, and image-to-image translation in computer vision.

Automatic music transcription: challenges and future directions

Limits of current transcription methods are analyzed and promising directions for future research are identified, including the integration of information from multiple algorithms and different musical aspects.

Transcription Is All You Need: Learning To Separate Musical Mixtures With Score As Supervision

This work uses musical scores, which are comparatively easy to obtain, as a weak label for training a source separation system, and proposes two novel adversarial losses for additional fine-tuning of both the transcriptor and the separator.

Multi-Instrument Music Transcription Based on Deep Spherical Clustering of Spectrograms and Pitchgrams

The proposed clustering-based music transcription method can transcribe musical pieces including unknown musical instruments as well as those containing only predefined instruments, at the state-of-the-art transcription accuracy.

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

By using notes as an intermediate representation, a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude are trained, a process the authors call Wave2Midi2Wave.

MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer

We introduce MIDI-VAE, a neural network model based on Variational Autoencoders that is capable of handling polyphonic music with multiple instrument tracks, as well as modeling the dynamics of music

MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding

An attempt to employ the mask language modeling approach of BERT to pre-train a 12-layer Transformer model for tackling a number of symbolic-domain discriminative music understanding tasks, finding that, given a pretrained Transformer, the models outperform recurrent neural network based baselines with less than 10 epochs of fine-tuning.