Estimating nonnegative matrix model activations with deep neural networks to increase perceptual speech quality.

@article{Williamson2015EstimatingNM,
  title={Estimating nonnegative matrix model activations with deep neural networks to increase perceptual speech quality.},
  author={Donald S. Williamson and Yuxuan Wang and Deliang Wang},
  journal={The Journal of the Acoustical Society of America},
  year={2015},
  volume={138 3},
  pages={
          1399-407
        }
}
As a means of speech separation, time-frequency masking applies a gain function to the time-frequency representation of noisy speech. On the other hand, nonnegative matrix factorization (NMF) addresses separation by linearly combining basis vectors from speech and noise models to approximate noisy speech. This paper presents an approach for improving the perceptual quality of speech separated from background noise at low signal-to-noise ratios. An ideal ratio mask is estimated, which separates… 

Figures and Tables from this paper

DEEP LEARNING METHODS FOR IMPROVING THE PERCEPTUAL QUALITY OF NOISY AND REVERBERANT SPEECH
TLDR
This dissertation presents work that develops speech separation systems using combinations of T-F masking, DNNs, and model-based reconstruction to improve the perceptual quality of the speech estimates.
Deep Learning Based Speech Separation via NMF-Style Reconstructions
TLDR
DNN directly optimizes an actual separation objective in the authors' system, so that the accumulated errors could be alleviated and the proposed models are competitive with the previous methods.
Ideal ratio mask estimation using supervised DNN approach for target speech signal enhancement
TLDR
This SWEMD-VVMDH technique is extended using Deep Neural Network (DNN) that learns the decomposed speech signals via SWEMH efficiently to achieve SE and the experimental outcomes exhibit considerable improvement in SE under different categories of noises.
Complex Ratio Masking for Monaural Speech Separation
TLDR
The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.
An ideal quantized mask to increase intelligibility and quality of speech in noise.
TLDR
It was found that as few as four to eight attenuation steps (IQM4, IQM8) improved intelligibility over the ideal binary mask (IBM), and equaled or exceeded that processed by the IRM.
Impact of phase estimation on single-channel speech separation based on time-frequency masking.
TLDR
The experiments demonstrate that replacing the mixture phase with the estimated clean spectral phase consistently improves perceptual speech quality, predicted speech intelligibility, and source separation performance across all signal-to-noise ratio and noise scenarios.
Deep neural networks for source separation and noise-robust speech recognition
TLDR
This thesis builds upon the classical expectation-maximization (EM) based source separation framework employing a multichannel Gaussian model, in which the sources are characterized by their power spectral densities and their source spatial covariance matrices.
Improving Speech Intelligibility Through Speaker Dependent and Independent Spectral Style Conversion
TLDR
The potential of conditional GANs (cGANs) to learn the mapping from habitual speech to clear speech is explored, and cGANs outperformed a traditional deep neural network mapping in terms of average keyword recall accuracy and the number of speakers with improved intelligibility.
Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems
TLDR
It is shown that DNN-based SE systems, when trained specifically to handle certain speakers, noise types and SNRs, are capable of achieving large improvements in estimated speech quality (SQ) and speech intelligibility (SI), when tested in matched conditions.
...
...

References

SHOWING 1-10 OF 32 REFERENCES
Deep neural networks for estimating speech model activations
TLDR
This paper uses two stages of deep neural networks, where the first stage estimates the ideal ratio mask that separates speech from noise, and the second stage maps the ratio-masked speech to the clean speech activation matrices that are used for nonnegative matrix factorization (NMF).
Reconstruction techniques for improving the perceptual quality of binary masked speech.
TLDR
The results show that the proposed techniques improve the perceptual quality of binary masked speech, and outperform traditional time-frequency reconstruction approaches.
An Experimental Study on Speech Enhancement Based on Deep Neural Networks
TLDR
This letter presents a regression-based speech enhancement framework using deep neural networks (DNNs) with a multiple-layer deep architecture that tends to achieve significant improvements in terms of various objective quality measures.
Ideal ratio mask estimation using deep neural networks for robust speech recognition
  • A. Narayanan, Deliang Wang
  • Computer Science
    2013 IEEE International Conference on Acoustics, Speech and Signal Processing
  • 2013
TLDR
The proposed feature enhancement algorithm estimates a smoothed ideal ratio mask (IRM) in the Mel frequency domain using deep neural networks and a set of time-frequency unit level features that has previously been used to estimate the ideal binary mask.
An algorithm that improves speech intelligibility in noise for normal-hearing listeners.
TLDR
The findings from this study suggest that algorithms that can estimate reliably the SNR in each T-F unit can improve speech intelligibility.
On Training Targets for Supervised Speech Separation
TLDR
Results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics, and that masking based targets, in general, are significantly better than spectral envelope based targets.
A two-stage approach for improving the perceptual quality of separated speech
TLDR
This paper proposes a two-stage algorithm that uses a soft mask in the first stage for separation, and NMF in the second stage for improving perceptual quality where only a speech model needs to be trained.
Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria
  • T. Virtanen
  • Computer Science
    IEEE Transactions on Audio, Speech, and Language Processing
  • 2007
TLDR
An unsupervised learning algorithm for the separation of sound sources in one-channel music signals is presented and enables a better separation quality than the previous algorithms.
An algorithm to improve speech recognition in noise for hearing-impaired listeners.
TLDR
Testing using normal-hearing and HI listeners indicated that intelligibility increased following processing in all conditions, and increases were larger for HI listeners, for the modulated background, and for the least-favorable SNRs.
Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection
TLDR
A new VAD algorithm based on boosted deep neural networks (bDNNs) is described that outperforms state-of-the-art VADs by a considerable margin and employs a new acoustic feature, multi-resolution cochleagram (MRCG), that concatenates the cochreagram features at multiple spectrotemporal resolutions and shows superior speech separation results over many acoustic features.
...
...