Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning

  title={Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning},
  author={Xubo Liu and Turab Iqbal and Jinzheng Zhao and Qiushi Huang and Mark D. Plumbley and Wenwu Wang},
  journal={2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)},
  • Xubo LiuTurab Iqbal Wenwu Wang
  • Published 21 July 2021
  • Computer Science
  • 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)
Deep generative models have recently achieved impressive performance in speech and music synthesis. However, compared to the generation of those domain-specific sounds, generating general sounds (such as siren, gunshots) has received less attention, despite their wide applications. In previous work, the SampleRNN method was considered for sound generation in the time domain. However, SampleRNN is potentially limited in capturing long-range dependencies within sounds as it only back-propagates… 

Figures and Tables from this paper

Full-band General Audio Synthesis with Score-based Diffusion

This work proposes a diffusion-based generative model for general audio synthesis, named DAG, which deals with full-band signals end-to-end in the waveform domain and believes DAG is capable enough to accommodate different conditioning schemas while providing good quality synthesis.

Diffsound: Discrete Diffusion Model for Text-to-sound Generation

This study investigates generating sound conditioned on a text prompt and proposes a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder, named Diffsound to overcome the shortcomings introduced by AR decoders.

Visual onoma-to-wave: environmental sound synthesis from visual onomatopoeias and sound-source images

We propose a method for synthesizing environmental sounds from visually represented onomatopoeias and sound sources. An onomatopoeia is a word that imitates a sound structure, i.e., the text

How Should We Evaluate Synthesized Environmental Sounds

—Although several methods of environmental sound synthesis have been proposed, there has been no discussion on how synthesized environmental sounds should be evaluated. Only either subjective or

Automated audio captioning: an overview of recent progress and new challenges

A comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets is presented.

VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration

Both objective and subjective evaluations show that VoiceFixer is effective on severely degraded speech, such as real-world his-torical speech recordings, and a synthesis stage that generates waveform using a neural vocoder.

Neural Vocoder is All You Need for Speech Super-resolution

This paper proposes a neural vocoder based speech super-resolution method that can handle a variety of input resolution and upsampling ratios and demonstrates that prior knowledge in the pre-trained vocoder is crucial for speech SR by performing mel-bandwidth extension with a simple replication-padding method.

Deep Neural Decision Forest for Acoustic Scene Classification

This paper proposes a novel approach for ASC using deep neural decision forest (DNDF), which combines a fixed number of convolutional layers and a decision forest as the final classifier and demonstrates that this method improves the ASC performance in terms of classification accuracy and shows competitive performance as compared with state-of-the-art baselines.

Automated identification of chicken distress vocalisations using deep learning models

A novel light-VGG11 was developed to automatically identify chicken distress calls using recordings collected on intensive chicken farms and the impacts of different data augmentation techniques were investigated and found that they could improve distress calls detection by up to 1.52%.

Taming Visually Guided Sound Generation

This work proposes a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU.



Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Onoma-to-wave: Environmental sound synthesis from onomatopoeic words

A method based on a sequence-tosequence framework for synthesizing environmental sounds from onomatopoeic words and the use of sound event labels in addition to onom atopoeoic words enables us to capture each sound event’s feature depending on the input sound event label.

Multi-Scale Residual Convolutional Encoder Decoder with Bidirectional Long Short-Term Memory for Single Channel Speech Enhancement

A multi-scale convolutional bidirectional long short-term memory (BLSTM) recurrent neural network, which is named as McbNet, a deep learning framework for end-to-end single channel speech enhancement, offers consistent improvement over the state-of-the-art methods and public datasets.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis.

Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization

A convolutional neural network transformer (CNN-Transfomer) is proposed for audio tagging and SED, and it is shown that CNN-Transformer performs similarly to a Convolutional recurrent neural network (CRNN).

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.

Overview of Tasks and Investigation of Subjective Evaluation Methods in Environmental Sound Synthesis and Conversion

This paper reports on environmental sound synthesis using sound event labels, in which it focuses on the current performance of statistical environmentalsound synthesis and investigates how to conduct subjective experiments on environmentalSound synthesis.

MelNet: A Generative Model for Audio in the Frequency Domain

This work designs a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve, and applies it to a variety of audio generation tasks, showing improvements over previous approaches in both density estimates and human judgments.

Acoustic Scene Generation with Conditional Samplernn

This paper proposes to use a conditional SampleRNN model to generate acoustic scenes conditioned on the input classes and proposes objective criteria to evaluate the quality and diversity of the generated samples based on classification accuracy.

Introduction to sound scene and event analysis

This chapter introduces the basic concepts and research problems and engineering challenges in computational environmental sound analysis, and motivate the field by briefly describing various applications where the methods can be used.