Corpus ID: 237439400

Text-Free Prosody-Aware Generative Spoken Language Modeling

  title={Text-Free Prosody-Aware Generative Spoken Language Modeling},
  author={Eugene Kharitonov and Ann Lee and Adam Polyak and Yossi Adi and Jade Copet and Kushal Lakhotia and Tu-Anh Nguyen and Morgane Rivi{\`e}re and Abdelrahman Mohamed and Emmanuel Dupoux and Wei-Ning Hsu},
  • E. Kharitonov, Ann Lee, +8 authors Wei-Ning Hsu
  • Published 7 September 2021
  • Computer Science, Engineering
  • ArXiv
Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) (Lakhotia et al., 2021) is the only prior work addressing the generative aspects of speech pretraining, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences… Expand

Figures and Tables from this paper

Textless Speech Emotion Conversion using Decomposed and Discrete Representations
This study decomposes speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion, and concludes with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Expand
PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition
This work proposes Prune-AdjustRe-Prune (PARP), which discovers and finetunes subnetworks for much better ASR performance, while only requiring a single downstream finetuning run, and demonstrates the computational advantage and performance gain of PARP over baseline pruning methods. Expand


Generative Spoken Language Modeling from Raw Audio
This work introduces metrics to automatically evaluate the generated output in terms of acoustic and linguistic quality in two associated endto-end tasks, respectively: speech resynthesis and speech generation, and will open source the evaluation stack and baseline models. Expand
Prosody-based automatic segmentation of speech into sentences and topics
This work combines prosodic cues with word-based approaches, and evaluates performance on two speech corpora, Broadcast News and Switchboard, finding that the prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Expand
Tacotron: Towards End-to-End Speech Synthesis
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. Expand
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
To generate disentangled representation, low-bitrate representations are extracted for speech content, prosodic information, and speaker identity to synthesize speech in a controllable manner using self-supervised discrete representations for speech resynthesis. Expand
Data Augmenting Contrastive Learning of Speech Representations in the Time Domain
WavAugment is intro-duce, a time-domain data augmentation library which is adapt and optimize for the specificities of CPC (raw waveform input, contrastive loss, past versus future structure), and finds that applying augmentation only to the segments from which the CPC prediction is performed yields better results. Expand
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Expand
Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech?
It is suggested that DAs are redundantly marked in natural conversation, and that a variety of automatically extractable prosodic features could aid dialog processing in speech applications. Expand
Unsupervised Cross-Domain Singing Voice Conversion
The proposed approach is fully-convolutional and can generate audio in real-time and significantly outperforms the baseline methods while generating convincingly better audio samples than alternative attempts. Expand
Libri-Light: A Benchmark for ASR with Limited or No Supervision
  • Jacob Kahn, M. Rivière, +12 authors Emmanuel Dupoux
  • Computer Science, Engineering
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
A new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision, derived from open-source audio books from the LibriVox project, which is, to the authors' knowledge, the largest freely-available corpus of speech. Expand
Using Prosodic Features in Language Models for Meetings
Fourfold cross-validation experiments on the ICSI Meeting Corpus show that exploiting prosody for language modeling can significantly reduce the perplexity, and also have marginal reductions in word error rate. Expand