• Corpus ID: 246294735

The MSXF TTS System for ICASSP 2022 ADD Challenge

@article{Yang2022TheMT,
  title={The MSXF TTS System for ICASSP 2022 ADD Challenge},
  author={Chunyong Yang and Pengfei Liu and Yanli Chen and Hongbin Wang and Min Liu},
  journal={ArXiv},
  year={2022},
  volume={abs/2201.11400}
}
This paper presents our MSXF TTS system for Task 3.1 of the Audio Deep Synthesis Detection (ADD) Challenge 2022. We use an end to end text to speech system, and add a constraint loss to the system when training stage. The end to end TTS system is VITS, and the pre-training self-supervised model is wav2vec 2.0. And we also explore the influence of the speech speed and volume in spoofing. The faster speech means the less the silence part in audio, the easier to fool the detector. We also find the… 

Tables from this paper

References

SHOWING 1-10 OF 15 REFERENCES

From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

A system involving feedback constraint for multispeaker speech synthesis is presented, which manages to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verify network.

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.

AdaSpeech: Adaptive Text to Speech for Custom Voice

AdaSpeech is proposed, an adaptive TTS system for high-quality and efficient customization of new voices and achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice.

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

WenetSpeech is the current largest open-source Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition, and a novel end-to-end label error detection approach is proposed.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

This work presents a parallel endto-end TTS method that generates more natural sounding audio than current two-stage models and adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling.

MultiSpeech: Multi-Speaker Text to Speech with Transformer

A robust and high-quality Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment, and synthesizes more robust and better quality multi-speaker voice than naive Transformer based TTS.

One TTS Alignment To Rule Them All

This paper leverages the alignment mechanism proposed in RAD-TTS and improves alignment convergence speed, simplifies the training pipeline by eliminating need for external aligners, enhances robustness to errors on long utterances and improves the perceived speech synthesis quality, as judged by human evaluators.

What all do audio transformer models hear? Probing Acoustic Representations for Language Delivery and its Structure

This work compares the two recent audio transformer models, Mockingjay and wave2vec2.0, on a comprehensive set of language delivery and structure features including audio, fluency and pronunciation features and probes their understanding of textual surface, syntax, and semantic features.

End-to-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection

It is shown that better performance can be achieved when the fusion is performed within the model itself and when the representation is learned automatically from raw waveform inputs.