• Corpus ID: 1017981

Speech Enhancement using a Deep Mixture of Experts

  title={Speech Enhancement using a Deep Mixture of Experts},
  author={Shlomo E. Chazan and Jacob Goldberger and Sharon Gannot},
In this study we present a Deep Mixture of Experts (DMoE) neural-network architecture for single microphone speech enhancement. By contrast to most speech enhancement algorithms that overlook the speech variability mainly caused by phoneme structure, our framework comprises a set of deep neural networks (DNNs), each one of which is an 'expert' in enhancing a given speech type corresponding to a phoneme. A gating DNN determines which expert is assigned to a given speech segment. A speech… 

Figures and Tables from this paper

Deep recurrent mixture of experts for speech enhancement

A deep recurrent mixture of experts (DRMoE) architecture is proposed that addresses the large speech variability and the time-continuity of the speech signal by implementing the experts and the gating network as a recurrent neural network (RNN).

Speech Enhancement Using Deep Mixture of Experts Based on Hard Expectation Maximization

This work proposes a pre-training method for individual DNN in deep mixture of experts, which uses hard expectation maximization (EM) to pre-train theindividual DNNs and takes a weighted combination of outputs of individual Dnn experts and jointly train the whole system.

Speech Enhancement Based on Deep Mixture of Distinguishing Experts

  • Xupeng JiaDongmei Li
  • Computer Science
    2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
  • 2019
This work proposes using distinguishing deep neural networks (DNNs) as experts, dealing with magnitude spectrogram and log-magnitude spectrogram respectively, and compared with the state-of-art DMoE system utilizing hard expectation maximization (HEM) pre-training method.

A Composite DNN Architecture for Speech Enhancement

This work shows that both separate cost functions are unsuitable for dealing with narrowband noise, and proposes a new composite estimator in the log-spectrum domain, which demonstrates superior performance for speech utterances contaminated by additive narrow band noise, while maintaining the enhancement quality of the baseline algorithms for wideband noise.

A Mixture of Expert Based Deep Neural Network for Improved ASR

A novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR), termed as MixNet, which uses two additional layers based on Mixture of Experts (MoE), which improves the separation between classes that translates to better ASR accuracy.


A variant of multiple deep neural network (DNN) based speech enhancement method that directly estimate clean speech spectrum as a weighted average of outputs from multiple DNNs using a gating network.

Incorporating Symbolic Sequential Modeling for Speech Enhancement

It is argued that familiarity with the underlying linguistic content of spoken utterances benefits speech enhancement (SE) in noisy environments and the proposed framework can obtain notable performance improvement in terms of perceptual evaluation of speech quality and short-time objective intelligibility on the TIMIT dataset.

Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot Learning with Knowledge Distillation

  • Sunwoo KimMinje Kim
  • Computer Science
    2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
  • 2021
A novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity, with the goal of utilizing no clean speech target of the test speaker, thus fulfilling the requirement for zero-shot learning.

English Spoken Digits Database under noise conditions for research: SDDN

  • A. OuisaadaneS. SafiM. Frikel
  • Computer Science
    2019 International Conference on Wireless Technologies, Embedded and Intelligent Systems (WITS)
  • 2019
A modified database for English spoken digits under all types of noise conditions (SDDN) is introduced, designed for use in scientific research, especially in the field of speech enhancement, nois robustness, background noise, speech recognition, noise reduction, signal processing.

Deep Graph Fusion Based Multimodal Evoked Expressions From Large-Scale Videos

A hybrid fusion model termed deep graph fusion for predicting viewers’ elicited expressions from videos by leveraging the combination of visual-audio representations and a semantic embedding loss to understand the semantic meaning of textual emotions in order to improve overall performance.



A Hybrid Approach for Speech Enhancement Using MoG Model and Neural Network Phoneme Classifier

A hybrid approach is proposed merging the generative mixture of Gaussians (MoG) model and the discriminative deep neural network (DNN) model, achieving a significant improvement over previous methods in terms of speech quality measures.

A phoneme-based pre-training approach for deep neural network with application to speech enhancement

A new phoneme-based deep neural network (DNN) framework for single microphone speech enhancement that outperforms other schemes that either do not consider the phoneme structure or use simpler training methodology.

Speech enhancement based on deep denoising autoencoder

Experimental results show that adding depth of the DAE consistently increase the performance when a large training data set is given, and compared with a minimum mean square error based speech enhancement algorithm, the proposed denoising DAE provided superior performance on the three objective evaluations.

Speech enhancement using a mixture-maximum model

A spectral domain, speech enhancement algorithm based on a mixture model for the short time spectrum of the clean speech signal, and on a maximum assumption in the production of the noisy speech spectrum that shows improved performance compared to alternative speech enhancement algorithms.

Phoneme-specific speech separation

Experiments on the corpus of the second CHiME speech separation and recognition challenge (task-2) demonstrate the effectiveness of this novel phoneme-specific speech separation method in terms of objective measures of speech intelligibility and quality, as well as recognition performance.

Speech recognition using noise-adaptive prototypes

A probabilistic mixture model is described for a frame (the short-term spectrum) of each component of each to be used in speech recognition, which model the energy as the larger of the separate energies of signal and noise in the band.

Single-Channel Speech Separation Using Soft Mask Filtering

The experimental results in terms of signal-to-noise ratio (SNR) and segmental SNR show that soft mask filtering outperforms binary mask and Wiener filtering.

Towards Scaling Up Classification-Based Speech Separation

This work proposes to learn more linearly separable and discriminative features from raw acoustic features and train linear SVMs, which are much easier and faster to train than kernel SVMs.

Deep Neural Network Based Supervised Speech Segregation Generalizes to Novel Noises through Large-scale Training

It is demonstrated that by training with a large number of different noises, the objective intelligibility results of DNN based supervised speech segregation on novel noises can match or even outperform those on trained noises.

Speech Enhancement Using a Multidimensional Mixture-Maximum Model

A single-microphone speech enhancement algorithm that models the log-spectrum of the noise-free speech signal by a multidimensional Gaussian mixture based on an earlier study which uses the single-dimensional mixture-maximum model for the speech signal.