AudioGen: Textually Guided Audio Generation

  title={AudioGen: Textually Guided Audio Generation},
  author={Felix Kreuk and Gabriel Synnaeve and Adam Polyak and Uriel Singer and Alexandre D'efossez and Jade Copet and Devi Parikh and Yaniv Taigman and Yossi Adi},
We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose A UDIO G EN , an auto-regressive generative model that generates audio samples conditioned on text inputs. A UDIO G EN operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating “objects” can be a difficult task (e.g., separating multiple people simultaneously… 

Figures and Tables from this paper

Audio Language Modeling using Perceptually-Guided Discrete Representations

The quality of samples generated by the method is evaluated on Audioset, the largest dataset for general audio to date, and it is shown that it is superior to the evaluated baseline audio encoders.

I Hear Your True Colors: Image Guided Audio Generation

An out-of-domain image dataset, denoted as I M AGE H EAR, which can be used as a benchmark for evaluating future image-to-audio models and an ablation study to better assess the impact of each of the method components on overall performance.

Full-band General Audio Synthesis with Score-based Diffusion

This work proposes a diffusion-based generative model for general audio synthesis, named DAG, which deals with full-band signals end-to-end in the waveform domain and believes DAG is capable enough to accommodate different conditioning schemas while providing good quality synthesis.

Speaking Style Conversion With Discrete Self-Supervised Units

This study introduces a method for converting not only the timbre, but also prosodic information (i.e., rhythm and pitch changes) to those of the target speaker through a pretrained, self-supervised, model for encoding speech to discrete units.

AERO: Audio Super Resolution in the Spectral Domain

This work presents AERO, a audio super-resolution model that processes speech and music signals in the spectral domain that is based on an encoder-decoder architecture with U-Net like skip connections and optimize the model using both time and frequency domain loss functions.

Regeneration Learning: A Learning Paradigm for Data Generation

Regeneration learning can be a widely-used paradigm for data generation (e.g., text generation, speech recognition, speech synthesis, music composition, image generation, and video generation) and can provide valuable insights into developing data generation methods.

Towards Practical Plug-and-Play Diffusion Models

This paper proposes a novel strategy that leverages multiple experts where each expert is specialized in a particular noise range and guides the reverse process at its corresponding timesteps, and presents a practical guidance framework, which leverages parameter-efficienttuning and data-free knowledge transfer.

Visual onoma-to-wave: environmental sound synthesis from visual onomatopoeias and sound-source images

We propose a method for synthesizing environmental sounds from visually represented onomatopoeias and sound sources. An onomatopoeia is a word that imitates a sound structure, i.e., the text

A Survey on Artificial Intelligence for Music Generation: Agents, Domains and Perspectives

To understand how AI models and algorithms generate music and the potential applications that might appear in the future, this paper explores, analyze and describes the agents that take part of the music generation process: the datasets, models, interfaces, the users and the generated music.

AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents, achieves state-of-the-art TTA performance measured by both objective and subjective metrics.



AudioLM: a Language Modeling Approach to Audio Generation

The proposed hybrid tokenization scheme leverages the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis.

AudioCaps: Generating Captions for Audios in The Wild

A large-scale dataset of 46K audio clips with human-written text pairs collected via crowdsourcing on the AudioSet dataset is contributed and two novel components that help improve audio captioning performance are proposed: the top-down multi-scale encoder and aligned semantic attention.

Diffsound: Discrete Diffusion Model for Text-to-sound Generation

This study investigates generating sound conditioned on a text prompt and proposes a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder, named Diffsound to overcome the shortcomings introduced by AR decoders.

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.

Clotho: an Audio Captioning Dataset

Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results, is presented.

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

The Pathways Autoregressive Text-to-Image (Parti) model is presented, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge and explores and highlights limitations of the models.

SoundStream: An End-to-End Neural Audio Codec

A novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs and perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency is presented.

Masked Autoencoders that Listen

Audio-MAE is a simple extension of image-based Masked Autoencoders to self-supervised representation learning from audio spectrograms, outperforming other recent models that use external supervised pre-training.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis.

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

To generate disentangled representation, low-bitrate representations are extracted for speech content, prosodic information, and speaker identity to synthesize speech in a controllable manner using self-supervised discrete representations.