Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement

  title={Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement},
  author={Sefik Emre Eskimez and Xiaofei Wang and Min Tang and Hemin Yang and Zirun Zhu and Zhuo Chen and Huaming Wang and Takuya Yoshioka},
With the surge of online meetings, it has become more critical than ever to provide high-quality speech audio and live captioning under various noise conditions. However, most monaural speech enhancement (SE) models introduce processing artifacts and thus degrade the performance of downstream tasks, including automatic speech recognition (ASR). This paper proposes a multi-task training framework to make the SE models unharmful to ASR. Because most ASR training samples do not have corresponding… 

Figures and Tables from this paper

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding

Results show that the integration of SE front-ends with back-end tasks is a promising research direction even for tasks besides ASR, especially in the multi-channel scenario.

Personalized speech enhancement: new models and Comprehensive evaluation

The results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models, and the multi-task training can alleviate the TSOS issue in addition to improving thespeech recognition accuracy.

Effect of Noise Suppression Losses on Speech Distortion and ASR Performance

  • Sebastian BraunH. Gamper
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
This work sheds light on the success of the spectral complex compressed mean squared error (MSE) loss, and how its magnitude and phase-aware terms are related to the speech distortion vs. noise reduction trade off.

SNRi Target Training for Joint Speech Enhancement and Recognition

This study proposes “ signal-to-noise ratio improvement (SNRi) target training”; the SE frontend is trained to output a signal whose SNRi is controlled by an auxiliary scalar input, and observes the jointly trained network automatically controls the target SNRu according to noise characteristics.

Leveraging Real Conversational Data for Multi-Channel Continuous Speech Separation

A three-stage training scheme for the CSS model that can leverage both supervised data and extra large-scale unsupervised real-world conversational data, and is applied to an array-geometry-agnostic CSS model, which can use the multi-channel data collected from any microphone array.

Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction

  • Heming WangYao Qian Deliang Wang
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
This work proposes to combine a reconstruction module with contrastive learning and perform multi-task continual pre-training on noisy data to improve the noise robustness of the learned representation and thus is not required during inference.

Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

An end-to-end enhancement (E3Net) model architecture is proposed, which is 3 × faster than a baseline STFT-based model, and KD techniques are used to develop compressed student models without significantly degrading quality.

End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party

Experimental results show that the fully E2E ASR model can achieve competitive performance on both noisy and reverberant conditions, with over 30% relative word error rate (WER) reduction over the single-channel baseline systems.

One Model to Enhance Them All: Array Geometry Agnostic Multi-Channel Personalized Speech Enhancement

A new causal array-geometry-agnostic multi-channel PSE model is proposed, which can generate a high-quality enhanced signal from arbitrary microphone geometry and outperforms the model trained on a specific microphone array geometry in both speech quality and automatic speech recognition accuracy.

Adversarial Multi-Task Learning for Disentangling Timbre and Pitch in Singing Voice Synthesis

This study proposes a singing voice synthesis model with multi-task learning to use both approaches – acoustic features for a parametric vocoder and mel-spectrograms for a neural vocoder to improve the quality of singing voices in a multi-singer model.



Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network

It is shown that a single-channel time-domain denoising approach can significantly improve ASR performance, providing more than 30 % relative word error reduction over a strong ASR back-end on the real evaluation data of the single- channel track of the CHiME-4 dataset.

Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning

Two approaches to improve deep neural network (DNN) acoustic models for speech recognition in reverberant environments are proposed, each using a parameterization of the reverberant environment extracted from the observed signal to train a room-aware DNN.

Exploring End-to-End Multi-Channel ASR with Bias Information for Meeting Transcription

This work investigates the joint modeling of a mask-based beamformer and Attention-Encoder-Decoder-based ASR and proposes an effective location bias integration method called deep concatenation for the beamformer network, which achieves a substantial word error rate reduction.

Frequency Domain Multi-channel Acoustic Modeling for Distant Speech Recognition

New acoustic modeling techniques that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an ASR criterion directly are developed and incorporated into the acoustic model.

Phase-aware Speech Enhancement with Deep Complex U-Net

A novel loss function, weighted source-to-distortion ratio (wSDR) loss, which is designed to directly correlate with a quantitative evaluation measure and achieves state-of-the-art performance in all metrics.

Intrusive and Non-Intrusive Perceptual Speech Quality Assessment Using a Convolutional Neural Network

A convolutional neural network is proposed to predict the perceived quality of speech with noise, reverberation, and distortions, both intrusively and non-intrusively, i.e., with and without a clean reference signal.

PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss

The novel PoCoNet architecture is a convolutional neural network that is able to more efficiently build frequency-dependent features in the early layers, and a new loss function biased towards preserving speech quality helps the optimization better match human perceptual opinions on speech quality.

Joint Time-Frequency and Time Domain Learning for Speech Enhancement

A cross-domain framework named TFTNet, which takes time-frequency spectrogram as input and produces time-domain waveform as output is presented, which achieves the highest SDR and SSNR among state-of-the-art methods on two major speech enhancement benchmarks.

Towards Efficient Models for Real-Time Deep Noise Suppression

This work investigates reasonably small recurrent and convolutional-recurrent network architectures for speech enhancement, trained on a large dataset considering also reverberation, and shows interesting tradeoffs between computational complexity and the achievable speech quality, measured on real recordings using a highly accurate MOS estimator.

SEGAN: Speech Enhancement Generative Adversarial Network

This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.