• Corpus ID: 238857242

Improve Cross-lingual Voice Cloning Using Low-quality Code-switched Data

  title={Improve Cross-lingual Voice Cloning Using Low-quality Code-switched Data},
  author={Haitong Zhang and Yue Lin},
Recently, sequence-to-sequence (seq-to-seq) models have been successfully applied in text-to-speech (TTS) to synthesize speech for single-language text. To synthesize speech for multiple languages usually requires multi-lingual speech from the target speaker. However, it is both laborious and expensive to collect high-quality multi-lingual TTS data for the target speakers. In this paper, we proposed to use low-quality code-switched found data from the non-target speakers to achieve cross… 

Figures and Tables from this paper


Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text
This work proposes a framework to synthesize code-mixed text by using a TTS database in a single language, identifying the language that each word was from, normalizing spellings of a language written in a non-standardized script and mapping the phonetic space of mixed language to thelanguage that the T TS database was recorded in.
End-to-end Code-switched TTS with Mix of Monolingual Recordings
  • Yuewen Cao, Xixin Wu, +5 authors H. Meng
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
The proposed E2E TTS systems can generate controllable foreign-accented speech at character-level using only mixture of monolingual training data and are confirmed to be effective in terms of quality and speaker similarity of the generated speech.
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
A multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages and be able to transfer voices across languages, e.g. English and Mandarin.
One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech
We introduce an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation and produces natural-sounding multilingual speech using more
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
It is shown that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.
Building a mixed-lingual neural TTS system with only monolingual data
The problem in the encoder-decoder framework when only monolingual data from a target speaker is available is looked at from two aspects: speaker consistency within an utterance and naturalness.
Turning a Monolingual Speaker into Multilingual for a Mixed-language TTS
A new approach to rendering speech of different languages with only a speaker’s monolingual recordings for mixed-code TTS applications is proposed, which aims to synthesize high quality, mixed-language (Chinese-English) speech in one consistent quality voice which is confirmed in both objective and subjective evaluations.
Utterance Selection for Optimizing Intelligibility of TTS Voices Trained on ASR Data
It is found that, when a large amount of data is available, selecting from the corpus based on criteria such as standard deviation of f0, fast speaking rate, and hypo-articulation produces the most intelligible voices.
HMM-Based Mixed-Language (Mandarin-English) Speech Synthesis
The bilingual state-mapping is extended to monolingual speaker to perform mixed-language synthesis and results show decent intelligibility and good speech quality.
A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS
A hidden Markov model (HMM)-based bilingual (Mandarin and English) text-to-speech (TTS) system to synthesize natural speech for given bilingual text with high intelligibility and higher quality of mixed-language output is proposed.