SEANet: A Multi-modal Speech Enhancement Network

  title={SEANet: A Multi-modal Speech Enhancement Network},
  author={Marco Tagliasacchi and Yunpeng Li and Karolis Misiunas and Dominik Roblek},
We explore the possibility of leveraging accelerometer data to perform speech enhancement in very noisy conditions. Although it is possible to only partially reconstruct user's speech from the accelerometer, the latter provides a strong conditioning signal that is not influenced from noise sources in the environment. Based on this observation, we feed a multi-modal input to SEANet (Sound EnhAncement Network), a wave-to-wave fully convolutional model, which adopts a combination of feature losses… 

Figures and Tables from this paper

WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration

WaveFit iteratively denoises an input signal, and trains a deep neural network for minimizing an adversarial loss calculated from intermediate outputs at all iterations, which showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations.

HiFi++: a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement

This paper proposes a novel HiFi++ general framework for neural vocoding, bandwidth extension, and speech enhancement, and shows that with the improved generator architecture and simplified multi-discriminator training, HiFi+ performs on par with the state-of-the-art in these tasks while spending significantly less memory and computational resources.

HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement

It is shown that with the improved generator architecture and simplifying multi-discriminator training, HiFi++ performs better or on par with the state-of-the-art in these tasks while spending significantly less computational resources.

Real-Time Speech Frequency Bandwidth Extension

A lightweight model for frequency bandwidth extension of speech signals, increasing the sampling frequency from 8kHz to 16kHz while restoring the high frequency content to a level almost indistinguishable from the 16kHz ground truth, achieving an architectural latency of 16ms.

High Fidelity Neural Audio Compression

A novel loss balancer mechanism to stabilize training is introduced: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss.

DCTCN:Deep Complex Temporal Convolutional Network for Long Time Speech Enhancement

The proposed DCTCN can be more effective in modeling long time series, and SKNet can extract and restore more fine-grained features, and on the TIMIT and VoiceBank+DEMAND datasets, the model obtains very com-petitive results.

AudioLM: a Language Modeling Approach to Audio Generation

The proposed hybrid tokenization scheme leverages the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis.

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

A new codec network based on vector-quantized auto-encoders with adversarial training (VQ-GAN) to extract intermediate frame-level speech representations and reconstruct speech waveform is proposed.

mmSpy: Spying Phone Calls using mmWave Radars

This paper presents a system mmSpy, which solves a number of challenges related to non-availability of large scale radar datasets, systematic correction of various sources of noises, as well as domain adaptation problems in harvesting training data, and shows the feasibility of eavesdropping phone calls remotely.

Multi-instrument Music Synthesis with Spectrogram Diffusion

This work compares training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and finds that the DDPM approach is superior both qualita-tively and as measured by audio reconstruction and Fréchet distance metrics.



SEGAN: Speech Enhancement Generative Adversarial Network

This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.

Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition

While GAN enhancement improves the performance of a clean-trained ASR system on noisy speech, it falls short of the performance achieved by conventional multi-style training (MTR), so a detailed study is conducted to measure the effectiveness of GANs in enhancing speech contaminated by both additive and reverberant noise.

An End-to-End Multimodal Voice Activity Detection Using WaveNet Encoder and Residual Networks

  • I. AriavI. Cohen
  • Computer Science
    IEEE Journal of Selected Topics in Signal Processing
  • 2019
Experimental results demonstrate the improved performance of the proposed end-to-end multimodal architecture compared to unimodal variants for VAD.

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.

Looking to listen at the cocktail party

A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech.

Universal Sound Separation

A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.

Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition

A deep denoising autoencoder (DDA) framework that can produce robust speech features for noisy reverberant speech recognition and shows a 16-25% absolute improvement on the recognition accuracy under various SNRs.

TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation

  • Yi LuoN. Mesgarani
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
Time-domain Audio Separation Network (TasNet) is proposed, which outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output.

Wavesplit: End-to-End Speech Separation by Speaker Clustering

Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers, as well as in noisy and reverberated settings, and set a new benchmark on the recent LibriMix dataset.