Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations

  title={Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations},
  author={Chin-Cheng Hsu},
We formulated non-speech vocalization (NSV) modeling as a text-to-speech task and verified its viability. Specifically, we evaluated the phonetic expressivity of HUBERT speech units on NSVs and verified our model’s ability to control over speaker timbre even though the training data is speaker few-shot. In addition, we substantiated that the heterogeneity in recording conditions is the major obstacle for NSV modeling. Finally, we discussed five improvements over our method for future research… 

Figures and Tables from this paper



Human Laughter Generation using Hybrid Generative Models

Two hybrid models that combine the representation learning capacity of variational autoencoder (VAE) with the temporal modelling ability of a long short-term memory RNN (LSTM) and the CNN ability to learn invariant features are suggested.

The ICML 2022 Expressive Vocalizations Workshop and Competition: Recognizing, Generating, and Personalizing Vocal Bursts

The ICML Ex pressive Vo calization (E X V O ) Competition is focused on understanding and generating vocal bursts: laughs, gasps, cries, and other non-verbal vocalizations that are central to

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

The Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss.

Evaluation of HMM-based laughter synthesis

The evaluation shows that the proposed HMM-based parametric synthesis method is significantly less natural than human and copy-synthesized laughs, but significantly improves laughter synthesis naturalness compared to the state of the art.

HooliGAN: Robust, High Quality Neural Vocoding

This work introduces HooliGAN, a robust vocoder that has state of the art results, finetunes very well to smaller datasets (<30 minutes of speech data) and generates audio at 2.2MHz on GPU and 35kHz on CPU.

When voices get emotional: A corpus of nonverbal vocalizations for research on emotion processing

A new corpus of nonverbal vocalizations, which consists of 121 sounds expressing four positive emotions and four negative ones, which is suitable for behavioral and neuroscience research and might as well be used in clinical settings for the assessment of neurological and psychiatric patients.

Generating Diverse Realistic Laughter for Interactive Art

LaughGANter is developed, an approach to reproduce the diversity of human laughter using generative adversarial networks (GANs) and learns a latent space suitable for emotional analysis and novel artistic applications such as latent mixing/interpolation and emotional transfer.

RTP Payload Format for the Opus Speech and Audio Codec

This document defines the Real-time Transport Protocol payload format for packetization of Opus-encoded speech and audio data necessary to integrate the codec in the most compatible way and describes media type registrations for the RTP payload format.

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

This work proposes a two time-scale update rule (TTUR) for training GANs with stochastic gradient descent on arbitrary GAN loss functions and introduces the "Frechet Inception Distance" (FID) which captures the similarity of generated images to real ones better than the Inception Score.