• Corpus ID: 235446688

Global Rhythm Style Transfer Without Text Transcriptions

  title={Global Rhythm Style Transfer Without Text Transcriptions},
  author={Kaizhi Qian and Yang Zhang and Shiyu Chang and Jinjun Xiong and Chuang Gan and David Cox and Mark A. Hasegawa-Johnson},
Prosody plays an important role in characterizing the style of a speaker or an emotion, but most non-parallel voice or emotion style transfer algorithms do not convert any prosody information. Two major components of prosody are pitch and rhythm. Disentangling the prosody information, particularly the rhythm component, from the speech is challenging because it involves breaking the synchrony between the input speech and the disentangled speech representation. As a result, most existing prosody… 

A Simple Feature Method for Prosody Rhythm Comparison

Of all components of Prosody, Rhythm has been regarded as the hardest to address, as it is utterly linked to Pitch and Intensity. Nevertheless, Rhythm is a very good indicator of a speaker’s fluency

Textless Speech Emotion Conversion using Decomposed and Discrete Representations

This study decomposes speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion, and concludes with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method.

Investigation into Target Speaking Rate Adaptation for Voice Conversion

This work employs an explicit and fully unsupervised disentanglement approach, which has previously only been used for representation learning, and it allows to obtain both superior voice conversion and content reconstruction and shows that the proposed adaptation allows to increase the speaking rate similarity with respect to the target speaker.

Textless Speech Emotion Conversion using Discrete and Decomposed Representations

This study uses a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion, to solve the problem of emotion conversion as a spoken language translation task.

ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Rhythm

Experimental results show that ControlVC realizes a good level of time-varying controllability on pitch, while achieving significantly better naturalness and timbre similarity than the comparison methods.

Enhancing Zero-Shot Many to Many Voice Conversion via Self-Attention VAE with Structurally Regularized Layers

This work found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance and hiding the source speaker’s identity, and applied relaxed group-wise splitting method (RGSM) to regularize network weights and remarkably enhance generalization performance.

Enhanced exemplar autoencoder with cycle consistency loss in any-to-one voice conversion

This work proposes a simple yet effective approach based on a cycle consistency loss to train eAEs of multiple speakers with a shared encoder, and encourages the speech reconstructed from any speaker-specific decoder to get a consistent latent code as the original speech when cycled back and encoded again.

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

This paper proposes a new voice conversion framework, i.e. Subband-based Generative Adversarial Network for Voice Conversion (SGAN-VC), which converts each subband content of the source speech separately by explicitly utilizing the spatial characteristics between different subbands.

CycleFlow: Purify Information Factors by Cycle Loss

A CycleFlow model is proposed that combines random factor substitution and cycle consistency loss to solve the problem of speech factorization in SpeechFlow and shows that the novel approach enforces independent information codes without sacrificing reconstruction loss.

MetaSpeech: Speech Effects Switch Along with Environment for Metaverse

From the experiment results on the public dataset of LJSpeech with four environment effects, the proposed model could complete the environment effect conversion and outperforms the baseline methods from the voice conversion task.



Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Unsupervised Speech Decomposition via Triple Information Bottleneck

SpeechSplit is among the first algorithms that can separately perform style transfer on timbre, pitch and rhythm without text labels and can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks.

CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

It is shown that the dynamic hierarchical network outperforms a non-hierarchical state-of-the-art baseline, and, additionally, that prosody transfer across sentences is possible by employing the prosody embedding of one sentence to generate the speech signal of another.

Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens

A multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data, and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.

F0-Consistent Many-To-Many Non-Parallel Voice Conversion Via Conditional Autoencoder

This work modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time and can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity.

Prosody conversion from neutral speech to emotional speech

The results support the use of a neutral semantic content text in databases for emotional speech synthesis by using "strong", "medium", and "weak" classifications.

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

An extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody results in synthesized audio that matches the prosody of the reference signal with fine time detail.

Vaw-Gan For Disentanglement And Recomposition Of Emotional Elements In Speech

This paper proposes a speaker-dependent EVC framework based on VAW-GAN, that includes a spectral encoder that disentangles emotion and prosody (F0) information from spectral features and a prosodic encoder which disentangled emotion modulation of prosody from linguistic prosody.

Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

A CycleGAN network is proposed to find an optimal pseudo pair from non-parallel training data by learning forward and inverse mappings simultaneously using adversarial and cycle-consistency losses and Experimental results show that the proposed framework outperforms the baselines both in objective and subjective evaluations.

Unsupervised Singing Voice Conversion

Evidence that the conversion produces natural signing voices that are highly recognizable as the target singer is presented, as well as new training losses and protocols that are based on backtranslation.