Optimizing Voice Conversion Network with Cycle Consistency Loss of Speaker Identity

  title={Optimizing Voice Conversion Network with Cycle Consistency Loss of Speaker Identity},
  author={Hongqiang Du and Xiaohai Tian and Lei Xie and Haizhou Li},
  journal={2021 IEEE Spoken Language Technology Workshop (SLT)},
We propose a novel training scheme to optimize voice conversion network with a speaker identity loss function. The training scheme not only minimizes frame-level spectral loss, but also speaker identity loss. We introduce a cycle consistency loss that constrains the converted speech to maintain the same speaker identity as reference speech at utterance level. While the proposed training scheme is applicable to any voice conversion networks, we formulate the study under the average model voice… 

Figures and Tables from this paper

SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System for Both Human Beings and Machines
Results show that the proposed system significantly reduces the trade-off problem in zero-shot voice conversion, while it also manages to have high spoofing power to the speaker verification system.
Noise-robust voice conversion with domain adversarial training
Deep Feature CycleGANs: Speaker Identity Preserving Non-Parallel Microphone-Telephone Domain Adaptation for Speaker Verification
CycleGAN-based unpaired translation of microphone data is explored to improve the x-vector/speaker embedding network for Telephony Speaker Verification and 3D convolution-based Deep Feature Discriminators show relative improvements of 5-10% in terms of equal error rate.
Beyond view transformation: feature distribution consistent GANs for cross-view gait recognition
A feature distribution consistent generative adversarial network (FDC-GAN) is proposed to transform gait images from arbitrary views to the target view and then perform identity recognition.


Many-to-many Cross-lingual Voice Conversion with a Jointly Trained Speaker Embedding Network
This work proposes a low dimensional trainable speaker embedding network that augments the primary VC network for joint training and compares it with the i-vector scheme, finding that the proposed system effectively improves the speech quality and speaker similarity.
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization
This paper proposed a novel one-shot VC approach which is able to perform VC by only an example utterance from source and target speaker respectively, and the source andtarget speaker do not even need to be seen during training.
Average Modeling Approach to Voice Conversion with Non-Parallel Data
The proposed approach makes use of a multi-speaker average model that maps speaker-independent linguistic features to speaker dependent acoustic features that doesn’t require parallel data in either average model training or adaptation.
An overview of voice conversion systems
From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint
A system involving feedback constraint for multispeaker speech synthesis is presented, which manages to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verify network.
One-Shot Voice Conversion with Global Speaker Embeddings
In experiments, when compared with an adaptation training based any-to-any VC system, the proposed GSEs based VC approach performs equally well or better in both speech naturalness and speaker similarity, with apparently higher flexibility to the comparison.
Voice Conversion Using GMM with Enhanced Global Variance
This work proposes a different approach for GV enhancement, based on the classical conversion formalized as a GV-constrained minimization, and shows that an improvement in quality is achieved.
Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance
The i-vector-based VC (IVC) approach is superior to SEVC, in terms of the quality of the converted speech and its similarity to the utterance produced by the genuine target speaker.
A Voice Conversion Framework with Tandem Feature Sparse Representation and Speaker-Adapted WaveNet Vocoder
A novel technique for sparse representation that augments the spectral features with phonetic information, or Tandem Feature is studied, and it is shown that it provides performance improvement consistently over the traditional sparse representations framework in objective and subjective evaluations.
Voice Conversion Using Dynamic Frequency Warping With Amplitude Scaling, for Parallel or Nonparallel Corpora
The proposed DFW with Amplitude scaling (DFWA) outperforms standard GMM and hybrid GMM-DFW methods for VC in terms of both speech quality and timbre conversion, as is confirmed in extensive objective and subjective testing.