• Corpus ID: 203952250

MelGAN-VC: Voice Conversion and Audio Style Transfer on arbitrarily long samples using Spectrograms

  title={MelGAN-VC: Voice Conversion and Audio Style Transfer on arbitrarily long samples using Spectrograms},
  author={Marco Pasini},
Traditional voice conversion methods rely on parallel recordings of multiple speakers pronouncing the same sentences. For real-world applications however, parallel data is rarely available. We propose MelGAN-VC, a voice conversion method that relies on non-parallel speech data and is able to convert audio signals of arbitrary length from a source voice to a target voice. We firstly compute spectrograms from waveform data and then perform a domain translation using a Generative Adversarial… 

Figures from this paper

An Adaptive Learning based Generative Adversarial Network for One-To-One Voice Conversion
An adaptive learning-based GAN model called ALGAN-VC for an efficient one-to-one VC of speakers is proposed that elegantly performed the voice conversion task by achieving high speaker similarity and adequate speech quality.
Improved Speech Synthesis using Generative Adversarial Networks
Mel-spectrogram GAN (MSGAN) is proposed that instead uses the Mel-Spectrogram of the audio signal as an aid in approximating the human auditory system response more closely than narrow frequency bands, suggesting that the Conditional MSGAN architecture is a promising approach for improved speech synthesis using GANs.
Attention-Guided Generative Adversarial Network for Whisper to Normal Speech Conversion
Experimental results demonstrate that the proposed AGANW2SC can obtain improved speech quality and intelligibility compared with dynamic-time-warping-based methods.
Audio style conversion using deep learning
This research has a broader scope when used in converting music from one genre to another, identification of synthetic voices, curating voices for AIs based on preference etc.
RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses
RefineGAN is proposed, a highfidelity neural vocoder with faster-than-real-time generation capability, and focused on the robustness, pitch and intensity accuracy, and full-band audio generation.
Voice Aging with Audio-Visual Style Transfer
This work analyzes the classification of a speaker’s age by training a convolutional neural network on the speaker's voice and face data from Common Voice and VoxCeleb datasets and generates aged voices from style transfer to transform an input spectrogram to various ages.
Neural Style Transfer Based Voice Mimicking for Personalized Audio Stories
A CNN based neural style transfer on audio dataset to make storytelling a personalized experience by asking users to record a few sentences that are used to mimic their voice to serve as a conjunction between storytelling and screen-time to incorporate children's interest through the implicit ethical themes of the stories.
Adversarial representation learning for private speech generation
A model based on generative adversarial networks (GANs) that learns to obfuscate specific sensitive attributes in speech data to hide sensitive information in the data, while preserving the meaning in the utterance is presented.
Unsupervised Musical Timbre Transfer for Notification Sounds
A method to transform artificial notification sounds into various musical timbres by adapting the problem for a cycle-consistent generative adversarial network and training it with unpaired samples from the source and the target domains.


StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks
Subjective evaluation experiments on a non-parallel many-to-many speaker identity conversion task revealed that the proposed method obtained higher sound quality and speaker similarity than a state-of-the-art method based on variational autoencoding GANs.
Evaluation of Expressive Speech Synthesis With Voice Conversion and Copy Resynthesis Techniques
  • O. Türk, M. Schröder
  • Computer Science
    IEEE Transactions on Audio, Speech, and Language Processing
  • 2010
The results show that there is a tradeoff between identification and naturalness and combined modeling of both voice quality and prosody leads to the best identification scores at the expense of lowest naturalness ratings.
Adversarial Audio Synthesis
WaveGAN is a first attempt at applying GANs to unsupervised synthesis of raw-waveform audio, capable of synthesizing one second slices of audio waveforms with global coherence, suitable for sound effect generation.
CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks
A non-parallel voice-conversion (VC) method that can learn a mapping from source to target speech without relying on parallel data is proposed that is general purpose and high quality and works without any extra data, modules, or alignment procedure.
Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks
The proposed SVC framework uses a similarity metric implicitly derived from a generative adversarial network, enabling the measurement of the distance in the high-level abstract space to mitigate the oversmoothing problem caused in the low-level data space.
Synthesizing Audio with Generative Adversarial Networks
WaveGAN is introduced, a first attempt at applying GANs to raw audio synthesis in an unsupervised setting and it is found that human judges prefer the generated examples from WaveGAN over those from a method which naively apply GAns on image-like audio feature representations.
Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion
CycleGAN-VC2 is proposed, which is an improved version of CycleGAN- VC incorporating three new techniques: an improved objective (two-step adversarial losses), improved generator (2-1-2D CNN), and improved discriminator (PatchGAN).
Voice Conversion Based on Speaker-Dependent Restricted Boltzmann Machines
This paper presents a voice conversion technique using speaker-dependent Restricted Boltzmann Machines (RBM) to build highorder eigen spaces of source/target speakers, where it is easier to convert
Spectral voice conversion for text-to-speech synthesis
  • A. Kain, Michael W. Macon
  • Physics
    Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181)
  • 1998
A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented and is found to perform more reliably for small training sets than a previous approach.
Mapping frames with DNN-HMM recognizer for non-parallel voice conversion
  • M. Dong, Chenyu Yang, Haizhou Li
  • Computer Science
    2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)
  • 2015
A DNN-HMM recognizer is proposed to be used to recognize each frame for both source and target speech signals to generate similar conversion results compared to parallel voice conversion.