A sinusoidal signal reconstruction method for the inversion of the mel-spectrogram

  title={A sinusoidal signal reconstruction method for the inversion of the mel-spectrogram},
  author={Anastasia Natsiou and Sean O'Leary},
  journal={2021 IEEE International Symposium on Multimedia (ISM)},
  • A. Natsiou, Sean O'Leary
  • Published 1 November 2021
  • Computer Science
  • 2021 IEEE International Symposium on Multimedia (ISM)
The synthesis of sound via deep learning methods has recently received much attention. Some problems for deep learning approaches to sound synthesis relate to the amount of data needed to specify an audio signal and the necessity of preserving both the long and short time coherence of the synthesised signal. Visual time-frequency representations such as the log-mel-spectrogram have gained in popularity. The log-mel-spectrogram is a perceptually informed representation of audio that greatly… 

Figures from this paper



Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.

Efficient Neural Audio Synthesis

A single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model, the WaveRNN, and a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once.

WaveNet: A Generative Model for Raw Audio

WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.

Speech analysis/Synthesis based on a sinusoidal representation

A sinusoidal model for the speech waveform is used to develop a new analysis/synthesis technique that is characterized by the amplitudes, frequencies, and phases of the component sine waves, which forms the basis for new approaches to the problems of speech transformations including time-scale and pitch-scale modification, and midrate speech coding.

DDSP: Differentiable Digital Signal Processing

The Differentiable Digital Signal Processing library is introduced, which enables direct integration of classic signal processing elements with deep learning methods and achieves high-fidelity generation without the need for large autoregressive models or adversarial losses.

MelNet: A Generative Model for Audio in the Frequency Domain

This work designs a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve, and applies it to a variety of audio generation tasks, showing improvements over previous approaches in both density estimates and human judgments.

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps

Analysis, Synthesis, and Perception of Musical Sounds: The Sound of Music (Modern Acoustics and Signal Processing)

Analysis and Synthesis of Musical Instrument Sounds: Analysis/Synthesis methods and Directions for Future Research.

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.