• Corpus ID: 43099890


  author={Carl Thom{\'e} and Sven Ahlb{\"a}ck},
Recent directions in automatic speech recognition (ASR) research have shown that applying deep learning models from image recognition challenges in computer vision is beneficial. As automatic music transcription (AMT) is superficially similar to ASR, in the sense that methods often rely on transforming spectrograms to symbolic sequences of events (e.g. words or notes), deep learning should benefit AMT as well. In this work, we outline an online polyphonic pitch detection system that streams… 

Figures from this paper

Onsets and Frames: Dual-Objective Piano Transcription

This work uses a deep convolutional and recurrent neural network to predict pitch onset events and then uses those predictions to condition framewise pitch predictions, which results in over a 100% relative improvement in note F1 score on the MAPS dataset.

Deep Polyphonic ADSR Piano Note Transcription

A late-fusion approach to piano transcription, combined with a strong temporal prior in the form of a handcrafted Hidden Markov Model (HMM), which is able to outperform other approaches by a large margin, when predicting complete note regions from onsets to offsets.

End-to-End Music Transcription Using Fine-Tuned Variable-Q Filterbanks

This work replaces the time-frequency calculation step of a baseline transcription architecture with a learned equivalent, initialized with the frequency response of a Variable-Q Transform, and the resulting filterbanks are visualized and evaluated against the standard transform.

Pitch-Informed Instrument Assignment Using a Deep Convolutional Network with Multiple Kernel Shapes

A deep convolutional neural network for performing note-level instrument assignment given a polyphonic multi-instrumental music signal along with its ground truth or predicted notes and the effects of the use of multiple kernel shapes and comparing different input representations for the audio and the note-related information is proposed.

The melodic beat: exploring asymmetry in polska performance

Some triple-beat forms in Scandinavian Folk Music are characterized by non-isochronous beat durations: asymmetric beats. Theorists of folk music have suggested that the variability of rhythmic

Improving Polyphonic Piano Transcription using Deep Residual Learning

In this thesis a new deep learning method is adapted for frame-wise polyphonic piano note transcription. It is based on the idea of Residual Learning which is then extended with Bidirectional Long



Polyphonic piano note transcription with recurrent neural networks

  • Sebastian BöckM. Schedl
  • Computer Science
    2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2012
A new approach for polyphonic piano note onset transcription based on a recurrent neural network to simultaneously detect the onsets and the pitches of the notes from spectral features and generalizes much better than existing systems.

Deep Salience Representations for F0 Estimation in Polyphonic Music

A fully convolutional neural network for learning salience representations for estimating fundamental frequencies, trained using a large, semi-automatically generated f0 dataset is described and shown to achieve state-of-the-art performance on several multi-f0 and melody datasets.

Very deep convolutional networks for end-to-end speech recognition

This work successively train very deep convolutional networks to add more expressive power and better generalization for end-to-end ASR models, and applies network-in-network principles, batch normalization, residual connections and convolutionAL LSTMs to build very deep recurrent and Convolutional structures.

An End-to-End Neural Network for Polyphonic Piano Music Transcription

An efficient variant of beam search is presented that improves performance and reduces run-times by an order of magnitude, making the model suitable for real-time applications.

Convolutional recurrent neural networks for music classification

It is found that CRNN show a strong performance with respect to the number of parameter and training time, indicating the effectiveness of its hybrid structure in music feature extraction and feature summarisation.

Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets.

A Shift-Invariant Latent Variable Model for Automatic Music Transcription

Results demonstrate that the proposed probabilistic model for multiple-instrument automatic music transcription outperforms leading approaches from the transcription literature, using several error metrics.

LSTM: A Search Space Odyssey

This paper presents the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling, and observes that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.

Sequence to Sequence Learning with Neural Networks

This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

On the Potential of Simple Framewise Approaches to Piano Transcription

It is shown that it is possible, by simple bottom-up frame-wise processing, to obtain a piano transcriber that outperforms the current published state of the art on the publicly available MAPS dataset -- without any complex post-processing steps.