SDR – Half-baked or Well Done?

@article{LeRoux2019SDRH,
  title={SDR – Half-baked or Well Done?},
  author={Jonathan Le Roux and Scott Wisdom and Hakan Erdogan and John R. Hershey},
  journal={ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2019},
  pages={626-630}
}
In speech enhancement and source separation, signal-to-noise ratio is a ubiquitous objective measure of denoising/separation quality. A decade ago, the BSS_eval toolkit was developed to give researchers worldwide a way to evaluate the quality of their algorithms in a simple, fair, and hopefully insightful way: it attempted to account for channel variations, and to not only evaluate the total distortion in the estimated signal but also split it in terms of various factors such as remaining… 

Figures and Tables from this paper

Phase-aware Single-stage Speech Denoising and Dereverberation with U-Net
TLDR
This work proposes a new masking method called phase-aware beta-sigmoid mask (PHM), which reuses the estimated magnitude values to estimate the clean phase by respecting the triangle inequality in the complex domain between three signal components such as mixture, source and the rest.
End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization
TLDR
The experimental result showed that the proposed denoising scheme significantly improved both SDR and PESQ performance over the existing methods.
Two-stage model and optimal SI-SNR for monaural multi-speaker speech separation in noisy environment
TLDR
A two-stage model based on conv-TasNet to deal with the notable effects of noises and interference speakers separately, where enhancement and separation are conducted sequentially using deep dilated temporal convolutional networks (TCN) and a new objective function named optimal scale-invariant signal-noise ratio (OSI-SNR), which are better than original SI- SNR at any circumstances.
Towards end-to-end speech enhancement with a variational U-Net architecture
TLDR
Experiments show that the residual (skip) connections in the proposed system are required for successful end-to-end signal enhancement, i.e., without filter mask estimation, and indicate a slight advantage of the variational U-Net architecture over its non-variational version in terms of signal enhancement performance under reverberant conditions.
Convolutive Transfer Function Invariant SDR Training Criteria for Multi-Channel Reverberant Speech Separation
TLDR
The proposed system approaches the error rate obtained on single-source non-reverberant input on LibriSpeech based reverberant mixtures, thus outperforming a conventional permutation invariant training based system and alternative objectives like Scale Invariant Signal-to-Distortion Ratio by a large margin.
Multi-Task Learning for End-to-End Noise-Robust Bandwidth Extension
TLDR
An end-to-end time-domain framework for noise-robust bandwidth extension, that jointly optimizes a mask-based speech enhancement and an ideal bandwidth extension module with multi-task learning, is proposed.
End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction
This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its
Leveraging Low-Distortion Target Estimates for Improved Speech Enhancement
TLDR
A novel explanation from the perspective of the low-distortion nature of such algorithms is provided, and it is found that they can consistently improve phase estimation.
How Bad Are Artifacts?: Analyzing the Impact of Speech Enhancement Errors on ASR
TLDR
The causes of ASR performance degradation are investigated by decomposing the SE errors using orthogonal projection-based decomposition (OPD), and it is demon-strate that the simple observation adding (OA) technique can monotonically increase the signal-to-artifact ratio under a mild condition.
TENET: A Time-Reversal Enhancement Network for Noise-Robust ASR
TLDR
This study presents TENET, a novel Time-reversal Enhancement NETwork, which leverages the transformation of an input noisy signal itself, in conjunction with a Siamese network and a complex dual-path Transformer to promote SE performance for noise-robust ASR.
...
...

References

SHOWING 1-10 OF 26 REFERENCES
Performance Based Cost Functions for End-to-End Speech Separation
TLDR
Subjective listening tests reveal that combinations of the proposed cost functions help achieve superior separation performance as compared to stand-alone MSE and SDR costs.
End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction
This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its
TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation.
TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
Time-domain Audio Separation Network (TasNet) is proposed, which outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output.
Subjective and Objective Quality Assessment of Audio Source Separation
TLDR
A family of objective measures aiming to predict subjective scores based on the decomposition of the estimation error into several distortion components and on the use of the PEMO-Q perceptual salience measure to provide multiple features that are then combined are proposed.
Performance measurement in blind audio source separation
TLDR
This paper considers four different sets of allowed distortions in blind audio source separation algorithms, from time-invariant gains to time-varying filters, and derives a global performance measure using an energy ratio, plus a separate performance measure for each error term.
Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks
TLDR
A phase-sensitive objective function based on the signal-to-noise ratio (SNR) of the reconstructed signal is developed, and it is shown that in experiments it yields uniformly better results in terms of signal- to-distortion ratio (SDR).
Speech enhancement based on deep denoising autoencoder
TLDR
Experimental results show that adding depth of the DAE consistently increase the performance when a large training data set is given, and compared with a minimum mean square error based speech enhancement algorithm, the proposed denoising DAE provided superior performance on the three objective evaluations.
Single-Channel Multi-Speaker Separation Using Deep Clustering
TLDR
This paper significantly improves upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline, and produces unprecedented performance on a challenging speech separation.
The 2018 Signal Separation Evaluation Campaign
TLDR
This year's edition of SiSEC was focused on audio and pursued the effort towards scaling up and making it easier to prototype audio separation software in an era of machine-learning based systems, including a new music separation database: MUSDB18.
...
...