Corpus ID: 235458015

EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model

  title={EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model},
  author={Chenye Cui and Yi Ren and Jinglin Liu and Feiyang Chen and Rongjie Huang and Ming Lei and Zhou Zhao},
Recently, there has been an increasing interest in neural speech synthesis. While the deep neural network achieves the stateof-the-art result in text-to-speech (TTS) tasks, how to generate a more emotional and more expressive speech is becoming a new challenge to researchers due to the scarcity of high-quality emotion speech dataset and the lack of advanced emotional TTS model. In this paper, we first briefly introduce and publicly release a Mandarin emotion speech dataset including 9,724… Expand

Figures and Tables from this paper


Controllable Emotion Transfer For End-to-End Speech Synthesis
The synthetic speech of the proposed method is more accurate and expressive with less emotion category confusions and the control of emotion strength is more salient to listeners. Expand
LSSED: A Large-Scale Dataset and Benchmark for Speech Emotion Recognition
A challenging large-scale english speech emotion dataset, which has data collected from 820 subjects to simulate real- world distribution, and some pre-trained models based on LSSED, which can not only promote the development of speech emotion recognition, but can also be transferred to related downstream tasks such as mental health analysis where data is extremely difficult to collect. Expand
The Vera am Mittag German audio-visual emotional speech database
This contribution presents a recently collected database of spontaneous emotional speech in German which is being made available to the research community and provides emotion labels for a great part of the data. Expand
Fine-Grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis
A unified model to conduct emotion transfer, control and prediction for sequence-to-sequence based fine-grained emotional speech synthesis, which can also predict phoneme-level emotion expressions from texts, which does not require any reference audio or manual label. Expand
Using Vaes and Normalizing Flows for One-Shot Text-To-Speech Synthesis of Expressive Speech
The proposed Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second provides a 22% KL-divergence reduction while jointly improving perceptual metrics over state-of-the-art. Expand
MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations
The Multimodal EmotionLines Dataset (MELD), an extension and enhancement of Emotion lines, contains about 13,000 utterances from 1,433 dialogues from the TV-series Friends and shows the importance of contextual and multimodal information for emotion recognition in conversations. Expand
The PRIORI Emotion Dataset: Linking Mood to Emotion Detected In-the-Wild
Critical steps in developing this pipeline are presented, including a new in the wild emotion dataset, the PRIORI Emotion Dataset, collected from everyday smartphone conversational speech recordings, and activation/valence emotion recognition baselines on this dataset provide evidence and a working baseline for the use of emotion as a meta-feature for mood state monitoring. Expand
CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset
An audio-visual dataset uniquely suited for the study of multi-modal emotion expression and perception, which consists of facial and vocal emotional expressions in sentences spoken in a range of basic emotional states, can be used to probe other questions concerning the audio- visual perception of emotion. Expand
Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis
This work introduces the Text-Predicting Global Style Token (TP-GST) architecture, which treats GST combination weights or style embeddings as “virtual” speaking style labels within Tacotron, and shows that the system can render text with more pitch and energy variation than two state-of-the-art baseline models. Expand
Almost Unsupervised Text to Speech and Automatic Speech Recognition
An almost unsupervised learning method that only leverages few hundreds of paired data and extra unpaired data for TTS and ASR and achieves 99.84% in terms of word level intelligible rate and 2.68 MOS on LJSpeech dataset. Expand