PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss

@inproceedings{Isik2020PoCoNetBS,
  title={PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss},
  author={Umut Isik and Ritwik Giri and N. Phansalkar and J. Valin and Karim Helwani and A. Krishnaswamy},
  booktitle={INTERSPEECH},
  year={2020}
}
Neural network applications generally benefit from larger-sized models, but for current speech enhancement models, larger scale networks often suffer from decreased robustness to the variety of real-world use cases beyond what is encountered in training data. We introduce several innovations that lead to better large neural networks for speech enhancement. The novel PoCoNet architecture is a convolutional neural network that, with the use of frequency-positional embeddings, is able to more… Expand
Semi-Supervised Singing Voice Separation With Noisy Self-Training
TLDR
Empirical results show that the proposed self-training scheme, along with data augmentation methods, effectively leverage the large unlabeled corpus and obtain superior performance compared to supervised methods. Expand
Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement.
With the surge of online meetings, it has become more critical than ever to provide high-quality speech audio and live captioning under various noise conditions. However, most monaural speechExpand
Personalized PercepNet: Real-time, Low-complexity Target Voice Separation and Enhancement
The presence of multiple talkers in the surrounding environment poses a difficult challenge for real-time speech communication systems considering the constraints on network size and complexity. InExpand
Training Speech Enhancement Systems with Noisy Speech Datasets
TLDR
This paper proposes several modifications of the loss functions, which make them robust against noisy speech targets, and proposes a noise augmentation scheme for mixture-invariant training (MixIT), which allows using it also in such scenarios. Expand
HIFI-GAN-2: STUDIO-QUALITY SPEECH ENHANCEMENT VIA GENERATIVE ADVERSARIAL NETWORKS CONDITIONED ON ACOUSTIC FEATURES
Modern speech content creation tasks such as podcasts, video voice-overs, and audio books require studio-quality audio with full bandwidth and balanced equalization (EQ). These goals pose a challengeExpand
On The Compensation Between Magnitude and Phase in Speech Separation
TLDR
Analytical results based on monaural speech separation and robust automatic speech recognition (ASR) tasks in noisyreverberant conditions support the validity of the novel view from the perspective of the implicit compensation between estimated magnitude and phase. Expand
Two Heads are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement
TLDR
A novel complex spectral mapping approach with a two-stage pipeline for monaural speech enhancement in the time-frequency domain that aims to decouple the primal problem into multiple sub-problems, which achieves state-of-the-art performance over previous advanced systems under various conditions. Expand
A Modulation-Domain Loss for Neural-Network-Based Real-Time Speech Enhancement
  • Tyler Vuong, Yangyang Xia, R. Stern
  • Engineering, Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
Experiments showed that adding the modulation-domain MSE to the MSE in the spectro-temporal domain substantially improved the objective prediction of speech quality and intelligibility for real-time speech enhancement systems without incurring additional computation during inference. Expand
Enhancing into the Codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders
Audio codecs based on discretized neural autoencoders have recently been developed and shown to provide significantly higher compression levels for comparable quality speech out-put. However, theseExpand
Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses
TLDR
A CCBAM is a lightweight and general module which can be easily integrated into any complex-valued convolutional layers and a mixed loss function is proposed to jointly optimize the complex models in both time-frequency (TF) domain and time domain. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 28 REFERENCES
Speech Denoising with Deep Feature Losses
TLDR
An end-to-end deep learning approach to denoising speech signals by processing the raw waveform directly, which outperforms the state-of-the-art in objective speech quality metrics and in large-scale perceptual experiments with human listeners. Expand
Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR
TLDR
It is demonstrated that LSTM speech enhancement, even when used 'naively' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Expand
A Regression Approach to Speech Enhancement Based on Deep Neural Networks
TLDR
The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods. Expand
Learning Spectral Mapping for Speech Dereverberation and Denoising
TLDR
Deep neural networks are trained to directly learn a spectral mapping from the magnitude spectrogram of corrupted speech to that of clean speech, which substantially attenuates the distortion caused by reverberation, as well as background noise, and is conceptually simple. Expand
The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework
TLDR
A large clean speech and noise corpus is opened for training the noise suppression models and a representative test set to real-world scenarios consisting of both synthetic and real recordings and an online subjective test framework based on ITU-T P.808 for researchers to quickly test their developments. Expand
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science, Medicine
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
TLDR
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures. Expand
Attention Wave-U-Net for Speech Enhancement
TLDR
It is found that the inclusion of the attention mechanism significantly improves the performance of the model in terms of the objective speech quality metrics, and outperforms all other published speech enhancement approaches on the Voice Bank Corpus (VCTK) dataset. Expand
Improved Speech Enhancement with the Wave-U-Net
TLDR
The Wave-U-Net architecture, a model introduced by Stoller et al for the separation of music vocals and accompaniment, is studied, finding that a reduced number of hidden layers is sufficient for speech enhancement in comparison to the original system designed for singing voice separation in music. Expand
A Fully Convolutional Neural Network for Speech Enhancement
TLDR
The proposed network, Redundant Convolutional Encoder Decoder (R-CED), demonstrates that a convolutional network can be 12 times smaller than a recurrent network and yet achieves better performance, which shows its applicability for an embedded system: the hearing aids. Expand
Channel-Attention Dense U-Net for Multichannel Speech Enhancement
TLDR
This paper proposes Channel-Attention Dense U-Net, in which the channel-attention unit is applied recursively on feature maps at every layer of the network, enabling the network to perform non-linear beamforming. Expand
...
1
2
3
...