Conditional End-to-End Audio Transforms

  title={Conditional End-to-End Audio Transforms},
  author={Albert Haque and Michelle Guo and Prateek Verma},
We present an end-to-end method for transforming audio from one style to another. [] Key Method Architecturally, our method is a fully-differentiable sequence-to-sequence model based on convolutional and hierarchical recurrent neural networks. It is designed to capture long-term acoustic dependencies, requires minimal post-processing, and produces realistic audio transforms. Ablation studies confirm that our model can separate speaker and instrument properties from acoustic content at different receptive…

Figures and Tables from this paper

SING: Symbol-to-Instrument Neural Generator
This work presents a lightweight neural audio synthesizer trained end-to-end to generate notes from nearly 1000 instruments with a single decoder, thanks to a new loss function that minimizes the distances between the log spectrograms of the generated and target waveforms.
GoodBye WaveNet - A Language Model for Raw Audio with Context of 1/2 Million Samples
This work proposes a generative auto-regressive architecture that can model audio waveforms over quite a large context, greater than 500,000 samples, on a standard dataset for modeling long-term structure.
Direct speech-to-speech translation with a sequence-to-sequence model
We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text
Speech-To-Singing Conversion in an Encoder-Decoder Framework
This paper proposes an encoder–decoder framework that enables singing that preserves the linguistic content and timbre of the speaker while adhering to the target melody in time-frequency representations of speech and a target melody contour.
Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion
This paper proposes Blow, a single-scale normalizing flow using hypernetwork conditioning to perform many-to-many voice conversion between raw audio, and shows that Blow compares favorably to existing flow-based architectures and other competitive baselines, obtaining equal or better performance in both objective and subjective evaluations.
Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation
It is demonstrated that this model can be trained to normalize speech from any speaker regardless of accent, prosody, and background noise, into the voice of a single canonical target speaker with a fixed accent and consistent articulation and prosody.
A Framework for Contrastive and Generative Learning of Audio Representations
This paper presents a framework for contrastive learning for audio representations, in a self supervised frame work without access to any ground truth labels, and explores generative models based on state of the art transformer based architectures for learning latent spaces for audio signals.
A Generative Model for Raw Audio Using Transformer Architectures
  • Prateek VermaC. Chafe
  • Computer Science
    2021 24th International Conference on Digital Audio Effects (DAFx)
  • 2021
This paper proposes a novel way of doing audio synthesis at the waveform level using Transformer architectures, and shows how causal transformer generative models can be used for raw waveform synthesis.
Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or
This work shows how to surpass traditional convolutional neural network architectures, and come strikingly close to outperforming powerful Transformer architectures, which would pave way for exciting advancements in the field of representation learning without massive, end-to-end neural architectures.
ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion
A voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech.


Tacotron: Towards End-to-End Speech Synthesis
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
WaveNet: A Generative Model for Raw Audio
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Char2Wav: End-to-End Speech Synthesis
Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features.
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.
First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention
This paper attempts to bypass limitations using a novel end-to-end parametric TTS synthesis framework, i.e. the text analysis and acoustic modeling are integrated together employing an attention-based recurrent neural network.
Deep Voice: Real-time Neural Text-to-Speech
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional
A Universal Music Translation Network
This method is based on a multi-domain wavenet autoencoder, with a shared encoder and a disentangled latent space that is trained end-to-end on waveforms, allowing it to translate even from musical domains that were not seen during training.
Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks
The proposed SVC framework uses a similarity metric implicitly derived from a generative adversarial network, enabling the measurement of the distance in the high-level abstract space to mitigate the oversmoothing problem caused in the low-level data space.