ADEPT: A Dataset for Evaluating Prosody Transfer
@inproceedings{Torresquintero2021ADEPTAD, title={ADEPT: A Dataset for Evaluating Prosody Transfer}, author={Alexandra Torresquintero and Tian Huey Teh and Christine Wallis and Marlene Staib and Devang S. Ram Mohan and Vivian Hu and Lorenzo Foglianti and Jiameng Gao and Simon King}, booktitle={Interspeech}, year={2021} }
Text-to-speech is now able to achieve near-human naturalness and research focus has shifted to increasing expressivity. One popular method is to transfer the prosody from a reference speech sample. There have been considerable advances in using prosody transfer to generate more expressive speech, but the field lacks a clear definition of what successful prosody transfer means and a method for measuring it. We introduce a dataset of prosodically-varied reference natural speech samples for…
3 Citations
Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech
- Physics, Computer Science2022 IEEE Spoken Language Technology Workshop (SLT)
- 2023
It is shown that it is possible to clone the voice of a speaker as well as the prosody of a spoken reference independently without any degradation in quality and high similarity to both original voice and prosody, as the objective evaluation and human study show.
Multi-Lingual Multi-Task Speech Emotion Recognition Using wav2vec 2.0
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
A Multi-Lingual (MLi) and Multi-Task Learning (MTL) audio only SER system based on the multi-lingual pre-trained wav2vec 2.0 model that outperforms the state-of-the-art for the languages contained in the pre-training corpora.
PoeticTTS - Controllable Poetry Reading for Literary Studies
- PsychologyINTERSPEECH
- 2022
Speech synthesis for poetry is challenging due to specific intonation patterns inherent to poetic speech. In this work, we pro-pose an approach to synthesise poems with almost human like naturalness…
References
SHOWING 1-10 OF 30 REFERENCES
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
- PhysicsICML
- 2018
An extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody results in synthesized audio that matches the prosody of the reference signal with fine time detail.
CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech
- Computer ScienceINTERSPEECH
- 2020
This paper proposes CopyCat, a novel, many-to-many PT system that is robust to source speaker leakage, without using parallel data, through a novel reference encoder architecture capable of capturing temporal prosodic representations which are robust tosource speaker leakage.
Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features
- PhysicsArXiv
- 2019
The main idea is to incorporate well-known acoustic correlates of prosody such as pitch and loudness contours of the reference speech into a modern neural text-to-speech (TTS) synthesizer such as Tacotron2 (TC2).
Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
The proposed methods introduce temporal structures in the embedding networks, thus enabling fine-grained control of the speaking style of the synthesized speech and introducing the temporal normalization of prosody embeddings, which shows better robustness against speaker perturbations during prosody transfer tasks.
Improving the Prosody of RNN-Based English Text-To-Speech Synthesis by Incorporating a BERT Model
- Computer ScienceINTERSPEECH
- 2020
It is shown that incorporating a BERT model in an RNN-based speech synthesis model — where the Bert model is pretrained on large amounts of unlabeled data, and fine-tuned to the speech domain — improves prosody and proposes a way of handling arbitrarily long sequences with BERT.
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
- Linguistics, Computer ScienceINTERSPEECH
- 2019
A multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages and be able to transfer voices across languages, e.g. English and Mandarin.
Social face to face communication - American English attitudinal prosody
- PsychologyINTERSPEECH
- 2013
The recording paradigm and the perceptual evaluation of a corpus of 16 prosodic social affects performed by a set of 8 native American English speakers are presented and variations in the prosodic and facial strategies observed are described and discussed in light of Ohala’s frequency code.
A prosody tutorial for investigators of auditory sentence processing
- PsychologyJournal of psycholinguistic research
- 1996
Evidence that, because syntax does not fully predict the way that spoken utterances are organized, prosody is a significant issue for studies of auditory sentence processing is presented.
Factors in the recognition of vocally expressed emotions: A comparison of four languages
- PsychologyJ. Phonetics
- 2009
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps…