Global Rhythm Style Transfer Without Text Transcriptions
@article{Qian2021GlobalRS, title={Global Rhythm Style Transfer Without Text Transcriptions}, author={Kaizhi Qian and Yang Zhang and Shiyu Chang and Jinjun Xiong and Chuang Gan and David Cox and Mark A. Hasegawa-Johnson}, journal={ArXiv}, year={2021}, volume={abs/2106.08519} }
Prosody plays an important role in characterizing the style of a speaker or an emotion, but most non-parallel voice or emotion style transfer algorithms do not convert any prosody information. Two major components of prosody are pitch and rhythm. Disentangling the prosody information, particularly the rhythm component, from the speech is challenging because it involves breaking the synchrony between the input speech and the disentangled speech representation. As a result, most existing prosody…
Figures and Tables from this paper
11 Citations
A Simple Feature Method for Prosody Rhythm Comparison
- Physics
- 2022
Of all components of Prosody, Rhythm has been regarded as the hardest to address, as it is utterly linked to Pitch and Intensity. Nevertheless, Rhythm is a very good indicator of a speaker’s fluency…
Textless Speech Emotion Conversion using Decomposed and Discrete Representations
- Computer ScienceArXiv
- 2021
This study decomposes speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion, and concludes with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method.
Investigation into Target Speaking Rate Adaptation for Voice Conversion
- Computer ScienceINTERSPEECH
- 2022
This work employs an explicit and fully unsupervised disentanglement approach, which has previously only been used for representation learning, and it allows to obtain both superior voice conversion and content reconstruction and shows that the proposed adaptation allows to increase the speaking rate similarity with respect to the target speaker.
Textless Speech Emotion Conversion using Discrete and Decomposed Representations
- Computer Science
- 2021
This study uses a decomposition of the speech signal into discrete learned representations, consisting of phonetic-content units, prosodic features, speaker, and emotion, to solve the problem of emotion conversion as a spoken language translation task.
ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Rhythm
- Computer ScienceArXiv
- 2022
Experimental results show that ControlVC realizes a good level of time-varying controllability on pitch, while achieving significantly better naturalness and timbre similarity than the comparison methods.
Enhancing Zero-Shot Many to Many Voice Conversion via Self-Attention VAE with Structurally Regularized Layers
- Computer Science
- 2022
This work found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance and hiding the source speaker’s identity, and applied relaxed group-wise splitting method (RGSM) to regularize network weights and remarkably enhance generalization performance.
Enhanced exemplar autoencoder with cycle consistency loss in any-to-one voice conversion
- Computer ScienceArXiv
- 2022
This work proposes a simple yet effective approach based on a cycle consistency loss to train eAEs of multiple speakers with a shared encoder, and encourages the speech reconstructed from any speaker-specific decoder to get a consistent latent code as the original speech when cycled back and encoded again.
Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion
- Computer ScienceArXiv
- 2022
This paper proposes a new voice conversion framework, i.e. Subband-based Generative Adversarial Network for Voice Conversion (SGAN-VC), which converts each subband content of the source speech separately by explicitly utilizing the spatial characteristics between different subbands.
CycleFlow: Purify Information Factors by Cycle Loss
- Computer ScienceOdyssey
- 2022
A CycleFlow model is proposed that combines random factor substitution and cycle consistency loss to solve the problem of speech factorization in SpeechFlow and shows that the novel approach enforces independent information codes without sacrificing reconstruction loss.
MetaSpeech: Speech Effects Switch Along with Environment for Metaverse
- Computer ScienceArXiv
- 2022
From the experiment results on the public dataset of LJSpeech with four environment effects, the proposed model could complete the environment effect conversion and outperforms the baseline methods from the voice conversion task.
References
SHOWING 1-10 OF 45 REFERENCES
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
- Computer ScienceICML
- 2018
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.
Unsupervised Speech Decomposition via Triple Information Bottleneck
- Computer ScienceICML
- 2020
SpeechSplit is among the first algorithms that can separately perform style transfer on timbre, pitch and rhythm without text labels and can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks.
CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network
- Computer ScienceICML
- 2019
It is shown that the dynamic hierarchical network outperforms a non-hierarchical state-of-the-art baseline, and, additionally, that prosody transfer across sentences is possible by employing the prosody embedding of one sentence to generate the speech signal of another.
Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
A multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data, and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.
F0-Consistent Many-To-Many Non-Parallel Voice Conversion Via Conditional Autoencoder
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This work modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time and can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity.
Prosody conversion from neutral speech to emotional speech
- Computer ScienceIEEE Transactions on Audio, Speech, and Language Processing
- 2006
The results support the use of a neutral semantic content text in databases for emotional speech synthesis by using "strong", "medium", and "weak" classifications.
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
- PhysicsICML
- 2018
An extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody results in synthesized audio that matches the prosody of the reference signal with fine time detail.
Vaw-Gan For Disentanglement And Recomposition Of Emotional Elements In Speech
- Computer Science2021 IEEE Spoken Language Technology Workshop (SLT)
- 2021
This paper proposes a speaker-dependent EVC framework based on VAW-GAN, that includes a spectral encoder that disentangles emotion and prosody (F0) information from spectral features and a prosodic encoder which disentangled emotion modulation of prosody from linguistic prosody.
Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data
- Computer ScienceOdyssey
- 2020
A CycleGAN network is proposed to find an optimal pseudo pair from non-parallel training data by learning forward and inverse mappings simultaneously using adversarial and cycle-consistency losses and Experimental results show that the proposed framework outperforms the baselines both in objective and subjective evaluations.
Unsupervised Singing Voice Conversion
- Computer ScienceINTERSPEECH
- 2019
Evidence that the conversion produces natural signing voices that are highly recognizable as the target singer is presented, as well as new training losses and protocols that are based on backtranslation.