ADEPT: A Dataset for Evaluating Prosody Transfer

@inproceedings{Torresquintero2021ADEPTAD,
  title={ADEPT: A Dataset for Evaluating Prosody Transfer},
  author={Alexandra Torresquintero and Tian Huey Teh and Christine Wallis and Marlene Staib and Devang S. Ram Mohan and Vivian Hu and Lorenzo Foglianti and Jiameng Gao and Simon King},
  booktitle={Interspeech},
  year={2021}
}
Text-to-speech is now able to achieve near-human naturalness and research focus has shifted to increasing expressivity. One popular method is to transfer the prosody from a reference speech sample. There have been considerable advances in using prosody transfer to generate more expressive speech, but the field lacks a clear definition of what successful prosody transfer means and a method for measuring it. We introduce a dataset of prosodically-varied reference natural speech samples for… 

Figures and Tables from this paper

Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

It is shown that it is possible to clone the voice of a speaker as well as the prosody of a spoken reference independently without any degradation in quality and high similarity to both original voice and prosody, as the objective evaluation and human study show.

Multi-Lingual Multi-Task Speech Emotion Recognition Using wav2vec 2.0

  • Mayank Sharma
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
A Multi-Lingual (MLi) and Multi-Task Learning (MTL) audio only SER system based on the multi-lingual pre-trained wav2vec 2.0 model that outperforms the state-of-the-art for the languages contained in the pre-training corpora.

PoeticTTS - Controllable Poetry Reading for Literary Studies

Speech synthesis for poetry is challenging due to specific intonation patterns inherent to poetic speech. In this work, we pro-pose an approach to synthesise poems with almost human like naturalness

References

SHOWING 1-10 OF 30 REFERENCES

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

An extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody results in synthesized audio that matches the prosody of the reference signal with fine time detail.

CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech

This paper proposes CopyCat, a novel, many-to-many PT system that is robust to source speaker leakage, without using parallel data, through a novel reference encoder architecture capable of capturing temporal prosodic representations which are robust tosource speaker leakage.

Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features

The main idea is to incorporate well-known acoustic correlates of prosody such as pitch and loudness contours of the reference speech into a modern neural text-to-speech (TTS) synthesizer such as Tacotron2 (TC2).

Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis

  • Younggun LeeTaesu Kim
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
The proposed methods introduce temporal structures in the embedding networks, thus enabling fine-grained control of the speaking style of the synthesized speech and introducing the temporal normalization of prosody embeddings, which shows better robustness against speaker perturbations during prosody transfer tasks.

Improving the Prosody of RNN-Based English Text-To-Speech Synthesis by Incorporating a BERT Model

It is shown that incorporating a BERT model in an RNN-based speech synthesis model — where the Bert model is pretrained on large amounts of unlabeled data, and fine-tuned to the speech domain — improves prosody and proposes a way of handling arbitrarily long sequences with BERT.

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

A multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages and be able to transfer voices across languages, e.g. English and Mandarin.

Social face to face communication - American English attitudinal prosody

The recording paradigm and the perceptual evaluation of a corpus of 16 prosodic social affects performed by a set of 8 native American English speakers are presented and variations in the prosodic and facial strategies observed are described and discussed in light of Ohala’s frequency code.

A prosody tutorial for investigators of auditory sentence processing

Evidence that, because syntax does not fully predict the way that spoken utterances are organized, prosody is a significant issue for studies of auditory sentence processing is presented.

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps