Deep Performer: Score-to-Audio Music Performance Synthesis

@inproceedings{Dong2022DeepPS,
  title={Deep Performer: Score-to-Audio Music Performance Synthesis},
  author={Hao-Wen Dong and Cong Zhou and Taylor Berg-Kirkpatrick and Julian McAuley},
  booktitle={ICASSP},
  year={2022}
}
Music performance synthesis aims to synthesize a musical score into a natural performance. In this paper, we borrow recent advances in text-to-speech synthesis and present the Deep Performer—a novel system for score-to-audio music performance synthesis. Unlike speech, music often contains polyphony and long notes. Hence, we propose two new techniques for handling polyphonic inputs and providing a finegrained conditioning in a transformer encoder-decoder model. To train our proposed system, we… 

References

SHOWING 1-10 OF 34 REFERENCES
PerformanceNet: Score-to-Audio Music Generation with Multi-Band Convolutional Residual Network
TLDR
A deep convolutional model is proposed that learns in an end-to-end manner the score- to-audio mapping between a symbolic representation of music called the pianorolls and an audio representation ofMusic called the spectrograms and achieves higher mean opinion score (MOS) in naturalness and emotional expressivity than a WaveNet-based model and two off-the-shelf synthesizers.
SynthNet: Learning to Synthesize Music End-to-End
TLDR
It is concluded that mappings between musical notes and the instrument timbre can be learned directly from the raw audio coupled with the musical score, in binary piano roll format.
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
TLDR
A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.
SING: Symbol-to-Instrument Neural Generator
TLDR
This work presents a lightweight neural audio synthesizer trained end-to-end to generate notes from nearly 1000 instruments with a single decoder, thanks to a new loss function that minimizes the distances between the log spectrograms of the generated and target waveforms.
Neural Music Synthesis for Flexible Timbre Control
TLDR
A neural music synthesis model with flexible timbre controls, which consists of a recurrent neural network conditioned on a learned instrument embedding followed by a WaveNet vocoder, is described.
Conditioning Deep Generative Raw Audio Models for Structured Automatic Music
TLDR
This paper considers a Long Short Term Memory network to learn the melodic structure of different styles of music, and then uses the unique symbolic generations from this model as a conditioning input to a WaveNet-based raw audio generator, creating a model for automatic, novel music.
Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset
TLDR
By using notes as an intermediate representation, a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude are trained, a process the authors call Wave2Midi2Wave.
MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling
TLDR
This work introduces MIDI-DDSP, a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control, and opens the door to assistive tools to empower individuals across a diverse range of musical experience.
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
TLDR
The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.
Tacotron: Towards End-to-End Speech Synthesis
TLDR
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
...
...