• Corpus ID: 237439400

Text-Free Prosody-Aware Generative Spoken Language Modeling

  title={Text-Free Prosody-Aware Generative Spoken Language Modeling},
  author={Eugene Kharitonov and Ann Lee and Adam Polyak and Yossi Adi and Jade Copet and Kushal Lakhotia and Tu-Anh Nguyen and Morgane Rivi{\`e}re and Abdel-rahman Mohamed and Emmanuel Dupoux and Wei-Ning Hsu},
Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) (Lakhotia et al., 2021) is the only prior work addressing the generative aspects of speech pretraining, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences… 

Figures and Tables from this paper

Textless Speech-to-Speech Translation on Real Data
To the knowledge, this work is the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs, and finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents.
Textless Speech Emotion Conversion using Decomposed and Discrete Representations
This study decomposes speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion, and concludes with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method.
How BPE Affects Memorization in Transformers
It is demonstrated that the size of the subword vocabulary learned by Byte-Pair Encoding greatly affects both ability and tendency of standard Transformer models to memorize training data, even when the authors control for the number of learned parameters.
PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition
This work proposes Prune-AdjustRe-Prune (PARP), which discovers and finetunes subnetworks for much better ASR performance, while only requiring a single downstream finetuning run, and demonstrates the computational advantage and performance gain of PARP over baseline pruning methods.


On Generative Spoken Language Modeling from Raw Audio
Generative Spoken Language Modeling is introduced, the task of learning the acoustic and linguistic characteristics of a language from raw audio and a set of metrics to automatically evaluate the learned representations atoustic and linguistic levels for both encoding and generation.
Prosody-based automatic segmentation of speech into sentences and topics
This work combines prosodic cues with word-based approaches, and evaluates performance on two speech corpora, Broadcast News and Switchboard, finding that the prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events.
Tacotron: Towards End-to-End Speech Synthesis
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations
To generate disentangled representation, low-bitrate representations are extracted for speech content, prosodic information, and speaker identity to synthesize speech in a controllable manner using self-supervised discrete representations for speech resynthesis.
Data Augmenting Contrastive Learning of Speech Representations in the Time Domain
WavAugment is intro-duce, a time-domain data augmentation library which is adapt and optimize for the specificities of CPC (raw waveform input, contrastive loss, past versus future structure), and finds that applying augmentation only to the segments from which the CPC prediction is performed yields better results.
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis.
Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech?
It is suggested that DAs are redundantly marked in natural conversation, and that a variety of automatically extractable prosodic features could aid dialog processing in speech applications.
Unsupervised Cross-Domain Singing Voice Conversion
The proposed approach is fully-convolutional and can generate audio in real-time and significantly outperforms the baseline methods while generating convincingly better audio samples than alternative attempts.
Libri-Light: A Benchmark for ASR with Limited or No Supervision
  • Jacob Kahn, M. Rivière, +12 authors Emmanuel Dupoux
  • Computer Science, Engineering
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
A new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision, derived from open-source audio books from the LibriVox project, which is, to the authors' knowledge, the largest freely-available corpus of speech.
Using Prosodic Features in Language Models for Meetings
Fourfold cross-validation experiments on the ICSI Meeting Corpus show that exploiting prosody for language modeling can significantly reduce the perplexity, and also have marginal reductions in word error rate.