Voice-transformation-based data augmentation for prosodic classification

  title={Voice-transformation-based data augmentation for prosodic classification},
  author={Raul Fernandez and Andrew Rosenberg and Alexander Sorin and Bhuvana Ramabhadran and Ron Hoory},
  journal={2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Raul Fernandez, A. Rosenberg, R. Hoory
  • Published 1 March 2017
  • Computer Science
  • 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
In this work we explore data-augmentation techniques for the task of improving the performance of a supervised recurrent-neural-network classifier tasked with predicting prosodic-boundary and pitch-accent labels. The technique is based on applying voice transformations to the training data that modify the pitch baseline and range, as well as the vocal-tract and vocal-source characteristics of the speakers to generate further training examples. We demonstrate the validity of the approach by… 

Tables from this paper

Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis
This work explores one-to-many style transfer from a dedicated single-speaker conversational corpus with style nuances and interjections and elaborate on the corpus design and explores the feasibility of such style transfer when assisted with Voice-Conversion-based data augmentation.
Speech, Prosody, and Machines: Nine Challenges for Prosody Research
This paper poses nine challenges designed to effectively and more thoroughly integrate prosody into current speech technologies, including long-standing and contemporary con-cerns surrounding the availability and utility of data, gaps in linguistic theory and technological issues.
Canonical Correlation Analysis and Prediction of Perceived Rhythmic Prominences and Pitch Tones in Speech
This work analyzes relationships between perceived prosodic events and acoustic features including syllable duration and novel measures of intensity and fundamental frequency and reveals two dominant prosodic dimensions relating the acoustic features and RaP annotations.
Oral English Evaluation Algorithm Based on Fuzzy Measures and Speech Recognition Technology
  • M. Guo
  • Education
    2021 5th International Conference on Trends in Electronics and Informatics (ICOEI)
  • 2021
As an important part of English teaching, oral English evaluation plays an important role in promoting students to learn English. The establishment of a diversified oral college English evaluation
Fuzzy Feature Extraction and Recognition Model in Korean Pronunciation Practice
  • Xuelai Qiu
  • Linguistics
    2021 5th International Conference on Trends in Electronics and Informatics (ICOEI)
  • 2021
The study of Korean has become a trend among young people, and more and more young people have a strong interest in Korean. Human language communication consists of four parts: listening, speaking,
The Role of Context in Neural Pitch Accent Detection in English
A new model for pitch accent detection is proposed, inspired by the work of Stehwien et al. (2018), who presented a CNN-based model for this task, that makes greater use of context by using full utterances as input and adding an LSTM layer.
Comparing Prosodic Frameworks: Investigating the Acoustic-Symbolic Relationship in ToBI and RaP
RaP is found to be promising, showing a somewhat stronger acoustic-symbolic relationship than ToBI given a comparable amount of data, and the utility of these annotation standards to correctly prescribe the acoustics of a given utterance from their symbolic sequences is examined.
Data Augmentation Improves Recognition of Foreign Accented Speech
Speed modification is found to be a remarkably reliable data augmentation technique for improving recognition of foreign accented speech.


Discriminative training and unsupervised adaptation for labeling prosodic events with limited training data
This paper explores applying conditional random fields to automatically label major and minor break indices and pitch accents from a corpus of recorded and transcribed speech using a large set of fully automaticallyextracted acoustic and linguistic features and demonstrates the robustness of these features when used in a discriminative training framework as a function of reducing the amount of training data.
Vocal Tract Length Perturbation (VTLP) improves speech recognition
Improvements in speech recognition are suggested without increasing the number of training epochs, and it is suggested that data transformations should be an important component of training neural networks for speech, especially for data limited projects.
An instantaneous vector representation of delta pitch for speaker-change prediction in conversational dialogue systems
This work focuses on system responsiveness, aiming to mimic human-like dialogue flow control by predicting speaker changes as observed in real human-human conversations, and derives an instantaneous vector representation of pitch variation which is amenable to standard acoustic modeling techniques.
Active learning for the prediction of prosodic phrase boundaries in Chinese speech synthesis systems using conditional random fields
  • Ziping ZhaoXirong Ma
  • Computer Science
    2015 IEEE/ACIS 16th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)
  • 2015
This study presents an approach based on active learning to predict the Chinese prosodic phrase boundaries in unrestricted Chinese text and shows that for most of the cases considered, the active selection strategies for labeling the prosodic phrases boundaries are as good as or exceed the performance of random data selection.
Semi-supervised Learning for Automatic Prosodic Event Detection Using Co-training Algorithm
This paper proposes a confidence-based method to assign labels to unlabeled data and demonstrates improved results using this method compared to the widely used agreement- based method.
Data Augmentation for Deep Neural Network Acoustic Modeling
Two data augmentation approaches, vocal tract length perturbation (VTLP) and stochastic feature mapping (SFM) for deep neural network acoustic modeling based on label-preserving transformations to deal with data sparsity are investigated.
Modeling phrasing and prominence using deep recurrent learning
Bidirectional Recurrent Neural Networks are examined as a function of a state variable that accumulates information over the entire input sequence, and by stacking several layers to form a deep architecture able to extract more structure from the input features.
Exploiting active-learning strategies for annotating prosodic events with limited labeled data
  • Raul FernandezB. Ramabhadran
  • Computer Science, Biology
    2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2011
This work explores active learning techniques with the objective of reducing the amount of human-annotated data needed to attain a given level of performance, and shows that for most of the cases considered, active selection strategies when labeling pitch accents and prosodic boundaries are as good as or exceed the performance of random data selection.
Two-Stage Data Augmentation for Low-Resourced Speech Recognition
An analysis exploring why multiple, complementary augmentation approaches to increasing the amount of training data are beneficial on low resourced languages from the IARPA Babel program are presented.
Language Model Data Augmentation for Keyword Spotting in Low-Resourced Training Conditions
This research extends earlier work on using machine translation (MT) and word-based recurrent neural networks to augment language model training data for keyword search in conversational Cantonese speech to study how these different methods of language model data augmentation impact speech-to-text and keyword spotting performance for the Lithuanian and Amharic languages.