SEANet: A Multi-modal Speech Enhancement Network

  title={SEANet: A Multi-modal Speech Enhancement Network},
  author={Marco Tagliasacchi and Yunpeng Li and Karolis Misiunas and Dominik Roblek},
We explore the possibility of leveraging accelerometer data to perform speech enhancement in very noisy conditions. Although it is possible to only partially reconstruct user's speech from the accelerometer, the latter provides a strong conditioning signal that is not influenced from noise sources in the environment. Based on this observation, we feed a multi-modal input to SEANet (Sound EnhAncement Network), a wave-to-wave fully convolutional model, which adopts a combination of feature losses… 

Figures and Tables from this paper

HiFi++: a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement
This paper proposes a novel HiFi++ general framework for neural vocoding, bandwidth extension, and speech enhancement, and shows that with the improved generator architecture and simplified multi-discriminator training, HiFi+ performs on par with the state-of-the-art in these tasks while spending significantly less memory and computational resources.
Real-Time Speech Frequency Bandwidth Extension
A lightweight model for frequency bandwidth extension of speech signals, increasing the sampling frequency from 8kHz to 16kHz while restoring the high frequency content to a level almost indistinguishable from the 16kHz ground truth, achieving an architectural latency of 16ms.
BEHM-GAN: Bandwidth Extension of Historical Music using Generative Adversarial Networks
The results of a formal blind listening test show that BEHM-GAN increases the perceptual sound quality in early- 20th-century gramophone recordings and represents a relevant step toward data-driven music restoration in real-world scenarios.
FFC-SE: Fast Fourier Convolution for Speech Enhancement
This work design neural network architectures which adapt FFC for speech enhancement and hypothesize that a large receptive population allows these networks to produce more coherent phases than vanilla convolutional models, and validate this hypothesis experimentally.
Let’s Grab a Drink: Teacher-Student Learning for Fluid Intake Monitoring using Smart Earphones
A voice pickup microphone that captures body vibrations during fluid consumption directly from skin contact and body conduction results in the extraction of stronger signals while being immune to ambient environmental noise to address the challenge of large-scale training datasets to train machine learning models (ML).
SpeechPainter: Text-conditioned Speech Inpainting
SpeechPainter is proposed, a model for speech inpainting in gaps of up to one second in speech samples by leveraging an auxiliary textual input that outperforms baselines constructed using adaptive TTS, as judged by human raters in side-by-side preference and MOS tests.
Text-Driven Separation of Arbitrary Sounds
This work proposes a method of separating a desired sound source from a single-channel mixture, based on either a textual description or a short audio sample of the target source, by combining two distinct models that are agnostic to the conditioning modal-ity.
Cross-Attention Conformer for Context Modeling in Speech Enhancement for ASR
Noise context, i.e., short noise-only audio segment preceding an utterance, can be used to build a speech enhancement feature frontend using cross-attention conformer layers for improving noise robustness of automatic speech recognition.
Micaugment: One-Shot Microphone Style Transfer
It is shown that the proposed method to perform one-shot microphone style transfer can successfully apply the style transfer to real audio and that it significantly increases model robustness when used as data augmentation in the downstream tasks.
One-Shot Conditional Audio Filtering of Arbitrary Sounds
We consider the problem of separating a particular sound source from a single-channel mixture, based on only a short sample of the target source (from the same recording). Using SoundFilter, a


SEGAN: Speech Enhancement Generative Adversarial Network
This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.
Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition
While GAN enhancement improves the performance of a clean-trained ASR system on noisy speech, it falls short of the performance achieved by conventional multi-style training (MTR), so a detailed study is conducted to measure the effectiveness of GANs in enhancing speech contaminated by both additive and reverberant noise.
An End-to-End Multimodal Voice Activity Detection Using WaveNet Encoder and Residual Networks
  • I. Ariav, I. Cohen
  • Computer Science
    IEEE Journal of Selected Topics in Signal Processing
  • 2019
Experimental results demonstrate the improved performance of the proposed end-to-end multimodal architecture compared to unimodal variants for VAD.
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.
Looking to listen at the cocktail party
A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech.
Universal Sound Separation
A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition
A deep denoising autoencoder (DDA) framework that can produce robust speech features for noisy reverberant speech recognition and shows a 16-25% absolute improvement on the recognition accuracy under various SNRs.
TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
Time-domain Audio Separation Network (TasNet) is proposed, which outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output.
Wavesplit: End-to-End Speech Separation by Speaker Clustering
Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers, as well as in noisy and reverberated settings, and set a new benchmark on the recent LibriMix dataset.