Generative adversarial network-based approach to signal reconstruction from magnitude spectrogram

  title={Generative adversarial network-based approach to signal reconstruction from magnitude spectrogram},
  author={Keisuke Oyamada and H. Kameoka and Takuhiro Kaneko and Kou Tanaka and Nobukatsu Hojo and Hiroyasu Ando},
  journal={2018 26th European Signal Processing Conference (EUSIPCO)},
In this paper, we address the problem of reconstructing a time-domain signal (or a phase spectrogram) solely from a magnitude spectrogram. [...] Key Method This method usually requires many iterations for the signal reconstruction process and depending on the inputs, it does not always produce high-quality audio signals.Expand
Phase Reconstruction Based On Recurrent Phase Unwrapping With Deep Neural Networks
A DNN-based two-stage phase reconstruction method where phase is recursively estimated based on the estimated derivatives, which is named recurrent phase unwrapping (RPU) and experimental results confirm that the proposed method outperformed the direct phase estimation by a DNN. Expand
Deep Griffin–Lim Iteration
A novel phase reconstruction method by combining a signal-processing-based approach and a deep neural network (DNN) by combining two GLA-inspired fixed layers and a DNN is presented. Expand
Phase Reconstruction with Learned Time-Frequency Representations for Single-Channel Speech Separation
This paper explicitly integrate phase reconstruction into the authors' separation algorithm using a loss function defined on time-domain signals, and allows the network to learn a modified version of these representations from data, instead of using fixed STFT/iSTFT time-frequency representations. Expand
Expediting TTS Synthesis with Adversarial Vocoding
This work proposes an alternative approach which utilizes generative adversarial networks (GANs) to learn mappings from perceptually-informed spectrograms to simple magnitude spectrographic representations which can be heuristically vocoded. Expand
StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks
Subjective evaluation experiments on a non-parallel many-to-many speaker identity conversion task revealed that the proposed method obtained higher sound quality and speaker similarity than a state-of-the-art method based on variational autoencoding GANs. Expand
Rectified Linear Unit Can Assist Griffin-Lim Phase Recovery
Phase recovery is an essential process for reconstructing a time-domain signal from the corresponding spectrogram when its phase is contaminated or unavailable. Recently, a phase recovery methodExpand
Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis
An approach to reduce speech synthesis delay has been proposed and the quality of the output speech has improved, which is advocated by higher Mean opinion scores (MOS) and faster convergence with FGLA as opposed to GLA. Expand
Audio Coding Based on Spectral Recovery by Convolutional Neural Network
The proposed method can enhance the coding performance, compared with conventional transform coding, and provides higher sound quality than the USAC, by an average MUSHRA score of 8.5. Expand
Joint Amplitude and Phase Refinement for Monaural Source Separation
The alternating direction method of multipliers (ADMM) is utilized to find time-domain signals whose amplitude spectrograms are close to the given ones in terms of the generalized alpha-beta divergences and confirmed the effectiveness of the proposed method through speech-nonspeech separation. Expand
Trainable Adaptive Window Switching for Speech Enhancement
  • Yuma Koizumi, N. Harada, Y. Haneda
  • Computer Science, Engineering
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
This study proposes a trainable adaptive window switching (AWS) method and applies it to a deep-neural-network (DNN) for speech enhancement in the modified discrete cosine transform domain and confirmed that the proposed method achieved a higher signal-to-distortion ratio than conventional speech enhancement methods in fixed-resolution frequency domains. Expand


The modification of magnitude spectrograms is at the core of many audio signal processing methods, from source separation to sound modification or noise canceling, and reconstructing a naturalExpand
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network
  • C. Ledig, Lucas Theis, +6 authors W. Shi
  • Computer Science, Mathematics
  • 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2017
SRGAN, a generative adversarial network (GAN) for image super-resolution (SR), is presented, to its knowledge, the first framework capable of inferring photo-realistic natural images for 4x upscaling factors and a perceptual loss function which consists of an adversarial loss and a content loss. Expand
Generative Adversarial Nets
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and aExpand
Least Squares Generative Adversarial Networks
This paper proposes the Least Squares Generative Adversarial Networks (LSGANs) which adopt the least squares loss function for the discriminator, and shows that minimizing the objective function of LSGAN yields minimizing the Pearson X2 divergence. Expand
Direct Modeling of Frequency Spectra and Waveform Generation Based on Phase Recovery for DNN-Based Speech Synthesis
Direct modeling of frequency spectra and waveform generation based on phase recovery of STFT spectral amplitudes that include harmonics information derived from F0 are directly predicted through a DNN-based acoustic model and Griffin and Lim’s approach to recover phase and generate waveforms is investigated. Expand
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
This work proposes a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit and derives a robust initialization method that particularly considers the rectifier nonlinearities. Expand
Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model
This paper presents Tacotron, an end- to-end generative text-to-speech model that synthesizes speech directly from characters, and presents several key techniques to make the sequence-tosequence framework perform well for this challenging task. Expand
Signal estimation from modified short-time Fourier transform
An algorithm to estimate a signal from its modified short-time Fourier transform (STFT) by minimizing the mean squared error between the STFT of the estimated signal and the modified STFT magnitude is presented. Expand
Deep Unfolding: Model-Based Inspiration of Novel Deep Architectures
This work starts with a model-based approach and an associated inference algorithm, and folds the inference iterations as layers in a deep network, and shows how this framework allows to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. Expand
Compositional Models for Audio Processing: Uncovering the structure of sound mixtures
Many classes of data are composed as constructive combinations of parts that do not result in subtraction or diminishment of any of the parts, and these models are referred to as compositional models. Expand