• Corpus ID: 239050121

Time-Domain Mapping Based Single-Channel Speech Separation With Hierarchical Constraint Training

  title={Time-Domain Mapping Based Single-Channel Speech Separation With Hierarchical Constraint Training},
  author={Chenyang Gao and Yue Gu and Ivan Marsic},
  • Chenyang Gao, Yue Gu, I. Marsic
  • Published 20 October 2021
  • Computer Science, Engineering
  • ArXiv
Single-channel speech separation is required for multi-speaker speech recognition. Recent deep learning-based approaches focused on time-domain audio separation net (TasNet) because it has superior performance and lower latency compared to the conventional time-frequency-based (T-F-based) approaches. Most of these works rely on the masking-based method that estimates a linear mapping function (mask) for each speaker. However, the other commonly used method, the mapping-based method that is less… 

Figures and Tables from this paper


TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science, Engineering
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
Time-domain Audio Separation Network (TasNet) is proposed, which outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output.
Time-Domain Mapping with Convolution Networks for End-to-End Monaural Speech Separation
Speech separation is the core problem of audio signal processing and key pre-processing step for automatic speech recognition. Magnitude spectrogram is reported as the standard time-and-frequency
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science, Medicine
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages.
TransMask: A Compact and Fast Speech Separation Model Based on Transformer
TransMask fully utilizes the parallelism during inference, and achieves nearly linear inference time within reasonable input audio lengths, and outperforms existing solutions on output speech audio quality, achieving SDR above 16 over Librimix benchmark.
A Deep Ensemble Learning Method for Monaural Speech Separation
A deep ensemble method, named multicontext networks, is proposed to address monaural speech separation and it is found that predicting the ideal time-frequency mask is more efficient in utilizing clean training speech, while predicting clean speech is less sensitive to SNR variations.
Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation
  • Yi Luo, Zhuo Chen, T. Yoshioka
  • Computer Science, Engineering
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
Experiments show that by replacing 1-D CNN with DPRNN and apply sample-level modeling in the time-domain audio separation network (TasNet), a new state-of-the-art performance on WSJ0-2mix is achieved with a 20 times smaller model than the previous best system.
A Regression Approach to Speech Enhancement Based on Deep Neural Networks
The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.
Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks
In this paper, we propose the utterance-level permutation invariant training (uPIT) technique. uPIT is a practically applicable, end-to-end, deep-learning-based solution for speaker independent
An Experimental Study on Speech Enhancement Based on Deep Neural Networks
This letter presents a regression-based speech enhancement framework using deep neural networks (DNNs) with a multiple-layer deep architecture that tends to achieve significant improvements in terms of various objective quality measures.