AudioLM: a Language Modeling Approach to Audio Generation

  title={AudioLM: a Language Modeling Approach to Audio Generation},
  author={Zal{\'a}n Borsos and Rapha{\"e}l Marinier and Damien Vincent and Eugene Kharitonov and Olivier Pietquin and Matthew Sharifi and Olivier Teboul and David Grangier and Marco Tagliasacchi and Neil Zeghidour},
—We introduce AudioLM, a framework for high- quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked… 

Figures and Tables from this paper

Audio Language Modeling using Perceptually-Guided Discrete Representations

The quality of samples generated by the method is evaluated on Audioset, the largest dataset for general audio to date, and it is shown that it is superior to the evaluated baseline audio encoders.

AudioGen: Textually Guided Audio Generation

This work proposes A UDIO G EN, an auto-regressive generative model that generates audio samples conditioned on text inputs that outperforms over both objective and subjective metrics and applies classifier-free guidance to improve adherence to text.

On The Robustness of Self-Supervised Representations for Spoken Language Modeling

This work empirically demonstrate how current state-of-the-art speech representation models lack robustness to basic signal variations that do not alter the spoken information and proposes an effective andcient method to learn robust self-supervised speech representation for generative spoken language modeling.

The Potential of Neural Speech Synthesis-based Data Augmentation for Personalized Speech Enhancement

In the proposed method, it is shown that the quality of the NSS system’s synthetic data matters, and if they are good enough the augmented dataset can be used to improve the PSE system that outperforms the speaker-agnostic baseline.

The Chamber Ensemble Generator: Limitless High-Quality MIR Data via Generative Modeling

This work combines a generative model of notes with a structured synthesis model of chamber ensembles to generate a system capable of producing unlimited amounts of realistic chorale music with rich annotations and releases both the system and the dataset as an open-source foundation for future work in the MIR community.

Modeling Animal Vocalizations through Synthesizers

Lighter-weight models that incorporate structured modules and domain knowledge, notably DDSP, have been shown to produce high-quality musical sound, however, a lack of signal-processing knowledge may hinder users from effectively manipulating the synthesis parameters.

Language Models Understand Us, Poorly

Some claim language models understand us. Others won’t hear it. To clarify, I investigate three views of human language understanding : as-mapping , as-reliability and as-representation (§2). I argue

A Survey on Artificial Intelligence for Music Generation: Agents, Domains and Perspectives

To understand how AI models and algorithms generate music and the potential applications that might appear in the future, this paper explores, analyze and describes the agents that take part of the music generation process: the datasets, models, interfaces, the users and the generated music.

A Theory of Unsupervised Translation Motivated by Understanding Animal Communication

This work proposes a theoretical framework for analyzing UMT when no parallel data are available and when it cannot be assumed that the source and target corpora address related subject domains or posses similar linguistic structure, and instantiate and analyze this framework with two complementary models of language.

token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

Self-supervised pre-training has been successful in both text and speech processing. Speech and text offer different but complementary information. The question is whether we are able to perform a



On Generative Spoken Language Modeling from Raw Audio

Generative Spoken Language Modeling is introduced, the task of learning the acoustic and linguistic characteristics of a language from raw audio and a set of metrics to automatically evaluate the learned representations atoustic and linguistic levels for both encoding and generation.

SoundStream: An End-to-End Neural Audio Codec

A novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs and perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency is presented.

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

By using notes as an intermediate representation, a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude are trained, a process the authors call Wave2Midi2Wave.

WaveNet: A Generative Model for Raw Audio

WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.

Are Discrete Units Necessary for Spoken Language Modeling?

This work shows that discretization is indeed essential for good results in spoken language modeling, and removes linguistically irrelevant information from the continuous features, helping to improve language modeling performances.

Text-Free Prosody-Aware Generative Spoken Language Modeling

Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt.

High Fidelity Speech Synthesis with Adversarial Networks

GAN-TTS is capable of generating high-fidelity speech with naturalness comparable to the state-of-the-art models, and unlike autoregressive models, it is highly parallelisable thanks to an efficient feed-forward generator.

Efficient Neural Audio Synthesis

A single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model, the WaveRNN, and a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once.

The Zero Resource Speech Challenge 2021: Spoken Language Modelling

This work provides a pipeline baseline system consisting on an encoder based on contrastive predictive coding (CPC), a quantizer and a standard language model (BERT or LSTM) and discusses the main results.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis.