A Shift-Invariant Latent Variable Model for Automatic Music Transcription


In this work, a probabilistic model for multiple-instrument automatic music transcription is proposed. The model extends the shift-invariant probabilistic latent component analysis method, which is used for spectrogram factorization. Proposed extensions support the use of multiple spectral templates per pitch and per instrument source, as well as a time-varying pitch contribution for each source. Thus, this method can effectively be used for multiple-instrument automatic transcription. In addition, the shift-invariant aspect of the method can be exploited for detecting tuning changes and frequency modulations, as well as for visualizing pitch content. For note tracking and smoothing, pitch-wise hidden Markov models are used. For training, pitch templates from eight orchestral instruments were extracted, covering their complete note range. The transcription system was tested on multiple-instrument polyphonic recordings from the RWC database, a Disklavier data set, and the MIREX 2007 multi-F0 data set. Results demonstrate that the proposed method outperforms leading approaches from the transcription literature, using several error metrics. Automatic music transcription refers to the process of converting musical audio, usually a recording, into some form of notation, e.g., sheet music, a MIDI file, or a “piano-roll” representation. It has applications in music information retrieval, computational musicology, and the creation of interactive music systems (e.g., real-time accompaniment, automatic instrument tutoring). The transcription problem can be separated into several subtasks, including multipitch estimation (which is considered to be the core problem of transcription), onset/offset detection, instrument identification, and rhythmic parsing. Although the problem of transcribing a monophonic recording is considered to be a solved problem in the literature, the creation of a transcription system able to handle polyphonic music produced by multiple instruments remains open. For reviews on multi-pitch detection and automatic transcription approaches, the reader is referred to de Cheveigné (2006) and Klapuri and Davy (2006). Approaches to transcription have used probabilistic methods (e.g., Kameoka, Nishimoto, and Sagayama 2007; Emiya, Badeau, and David 2010), audio feature-based techniques (e.g., Ryynänen and Klapuri 2008; Saito et al. 2008; Cañadas-Quesada et al. 2010), or machine learning approaches (e.g., Poliner and Ellis 2007). More recently, transcription systems Computer Music Journal, 36:4, pp. 81–94, Winter 2012 c © 2013 Massachusetts Institute of Technology. using spectrogram-factorization techniques have been proposed (e.g., Mysore and Smaragdis 2009; Dessein, Cont, and Lemaitre 2010; Grindlay and Ellis 2010; Fuentes, Badeau, and Richard 2011). The aim of these techniques is to decompose the input spectrogram into matrices denoting spectral templates and pitch activations. Transcription systems or pitch-tracking methods that use spectrogramfactorization models similar to the ones used in this article are detailed in the following section. Transcription approaches that use the same data sets used in this work include Poliner and Ellis (2007), where a piano-only transcription algorithm is proposed using support vector machines for note classification. For note smoothing, those authors fed the output of the classifier as input to a hidden Markov model (HMM) (Rabiner 1989). They performed experiments on a set of ten Disklavier recordings, which are also used in this article. The same postprocessing method was also used in the work of Cañadas-Quesada et al. (2010), where the joint multi-pitch estimation algorithm consists of a weighted Gaussian spectral distance measure. Saito et al. (2008) proposed an audio feature-based multiple-F0 estimation method that uses the inverse Fourier transform of the linear power spectrum with log-scale frequency, which is called specmurt. The input log-frequency spectrum is considered to be generated by a convolution of a single pitch template with a pitch indicator function. The deconvolution Benetos and Dixon 81 of the spectrum by the pitch template results in the estimated pitch indicator function. This method is roughly equivalent to the single-component shiftinvariant probabilistic latent component analysis method (Smaragdis, Raj, and Shashanka 2008), which will be detailed in the following section. Finally, we proposed an audio feature-based method for transcription (Benetos and Dixon 2011a), where joint multi-pitch estimation is performed using a weighted score function primarily based on features extracted from the harmonic envelopes of pitch candidates. Postprocessing is applied using conditional random fields. In this article, we propose a system for polyphonic music transcription based on a convolutive probabilistic model, which extends the shift-invariant probabilistic latent component analysis model (Smaragdis, Raj, and Shashanka 2008). The original model was proposed for relative pitch-tracking (estimating pitch changes on a relative scale) using a single pitch template per source. Here, the model is proposed for multi-pitch detection, supporting the use of multiple templates per pitch and instrument source. In addition, the source contribution is time-varying, making the model more robust for transcription, and sparsity is also enforced in order to further constrain the solution. Note smoothing is performed using HMMs trained on MIDI data from the Real World Computing (RWC) database (Goto et al. 2003). The output of the system is a pitch activity matrix in MIDI units and a time-pitch representation; the latter can be used for visualizing pitch content. We presented preliminary results using the proposed model in Benetos and Dixon (2011c), where the use of a residual template was not supported and the HMM postprocessing step did not include a smoothing parameter. This article contains experiments using additional recordings from the RWC database beyond the set we used in Benetos and Dixon (2011c). Here, we present results using 17 excerpts from the RWC database (classic and jazz recordings) (Goto et al. 2003), 10 recordings from a Disklavier piano (Poliner and Ellis 2007), and the MIREX 2007 multi-F0 woodwind recording (MIREX 2007). We have performed evaluations using several error metrics from the transcription literature, and results show that the proposed model outperforms other transcription methods from the literature. This model, using a time-frequency representation with lower frequency resolution, was publicly evaluated in MIREX 2011, where the submitted system ranked second in the note-tracking task (Benetos and Dixon 2011b). Finally, the proposed model can be further expanded for musical instrument identification in polyphonic music and can also be useful in instrument-specific transcription applications. The remainder of the article presents the shiftinvariant probabilistic latent component analysis method, the proposed model, and evaluation results compared with other state-of-the-art transcription methods.

DOI: 10.1162/COMJ_a_00146

Extracted Key Phrases

11 Figures and Tables

Citations per Year

52 Citations

Semantic Scholar estimates that this publication has 52 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@article{Benetos2012ASL, title={A Shift-Invariant Latent Variable Model for Automatic Music Transcription}, author={Emmanouil Benetos and Simon Dixon}, journal={Computer Music Journal}, year={2012}, volume={36}, pages={81-94} }