Speaker-Independent Speech Separation With Deep Attractor Network

@article{Luo2018SpeakerIndependentSS,
  title={Speaker-Independent Speech Separation With Deep Attractor Network},
  author={Yi Luo and Zhuo Chen and Nima Mesgarani},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2018},
  volume={26},
  pages={787-796}
}
Despite the recent success of deep learning for many speech processing tasks, single-microphone, speaker-independent speech separation remains challenging for two main reasons. The first reason is the arbitrary order of the target and masker speakers in the mixture (permutation problem), and the second is the unknown number of speakers in the mixture (output dimension problem). We propose a novel deep learning framework for speech separation that addresses both of these issues. We use a neural… Expand
Online Deep Attractor Network for Real-time Single-channel Speech Separation
  • Cong Han, Yi Luo, N. Mesgarani
  • Computer Science
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
Experimental results show that ODANet can achieve a similar separation accuracy as the noncausal DANet in both two speaker and three speaker speech separation problems, which makes it a suitable candidate for applications that require robust real-time speech processing. Expand
A dual-stream deep attractor network with multi-domain learning for speech dereverberation and separation
TLDR
A dual-stream DAN with multi-domain learning is proposed to efficiently perform both dereverberation and separation tasks under the condition of variable numbers of speakers to achieve scale-invariant source-to-distortion ratio (SI-SDR) improvement. Expand
Exploring the time-domain deep attractor network with two-stream architectures in a reverberant environment
TLDR
This study proposes a time-domain deep attractor network (TD-DAN) with two-stream convolutional networks that efficiently performs both dereverberation and separation tasks under the condition of variable numbers of speakers. Expand
Improved Source Counting and Separation for Monaural Mixture
TLDR
A novel model of single-channel multi-speaker separation by jointly learning the time-frequency feature and the unknown number of speakers is proposed, which achieves the state-of-the-art separation results on multi- Speaker mixtures in terms of scale-invariant signal- to-noise ratio improvement (SI-SNRi) and signal-to-distortion ratio improved (SDRi). Expand
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science, Medicine
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
TLDR
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures. Expand
Speaker Attractor Network: Generalizing Speech Separation to Unseen Numbers of Sources
TLDR
Experimental results show that the proposed method significantly improves the separation performance when generalizing to an unseen number of speakers, and can separate up to five speakers even the model is only trained on two-speaker mixtures. Expand
Guided Training: A Simple Method for Single-channel Speaker Separation
TLDR
This paper proposes a simple strategy to train a long short-term memory (LSTM) model to solve the permutation problem in speaker separation, and inserts a short speech of target speaker at the beginning of a mixture as guide information. Expand
Single-channel speech separation using Soft-minimum Permutation Invariant Training
  • Midia Yousefi, John H.L. Hansen
  • Computer Science, Engineering
  • ArXiv
  • 2021
TLDR
A probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment is proposed and is employed on the same Long-Short Term Memory (LSTM) architecture used in Permutation Invariant Training (PIT) speech separation method. Expand
Multi-channel Speech Separation Using Deep Embedding Model with Multilayer Bootstrap Networks
TLDR
A variant of DPCL is proposed, named DPCL++, by applying a recent unsupervised deep learning method---multilayer bootstrap networks(MBN)---to further reduce the noise and small variations of the embedding vectors in an un supervised way in the test stage, which fascinates k-means to produce a good result. Expand
Deep Attractor Networks for Speaker Re-Identification and Blind Source Separation
TLDR
This model structure improves the signal to distortion ratio (SDR) over a DAN baseline and provides up to 61% and up to 34% relative reduction in permutation error rate and re-identification error rate compared to an i-vector baseline, respectively. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 66 REFERENCES
Deep attractor network for single-microphone speaker separation
  • Zhuo Chen, Yi Luo, N. Mesgarani
  • Computer Science, Medicine
  • 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
TLDR
A novel deep learning framework for single channel speech separation by creating attractor points in high dimensional embedding space of the acoustic signals which pull together the time-frequency bins corresponding to each source. Expand
A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks
We propose a novel data-driven approach to single-channel speech separation based on deep neural networks (DNNs) to directly model the highly nonlinear relationship between speech features of a mixedExpand
A Deep Ensemble Learning Method for Monaural Speech Separation
TLDR
A deep ensemble method, named multicontext networks, is proposed to address monaural speech separation and it is found that predicting the ideal time-frequency mask is more efficient in utilizing clean training speech, while predicting clean speech is less sensitive to SNR variations. Expand
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
TLDR
This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages. Expand
Single-Channel Multi-Speaker Separation Using Deep Clustering
TLDR
This paper significantly improves upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline, and produces unprecedented performance on a challenging speech separation. Expand
Deep clustering and conventional networks for music separation: Stronger together
TLDR
It is shown that deep clustering outperforms conventional networks on a singing voice separation task, in both matched and mismatched conditions, even though conventional networks have the advantage of end-to-end training for best signal approximation. Expand
Deep clustering: Discriminative embeddings for segmentation and separation
TLDR
Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures. Expand
Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks
In this paper, we propose the utterance-level permutation invariant training (uPIT) technique. uPIT is a practically applicable, end-to-end, deep-learning-based solution for speaker independentExpand
Discriminatively trained recurrent neural networks for single-channel speech separation
TLDR
The results confirm the importance of fine-tuning the feature representation for DNN training and show consistent improvements by discriminative training, whereas long short-term memory recurrent DNNs obtain the overall best results. Expand
Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation
TLDR
Joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks, including speech separation, singing voice separation, and speech denoising, and a discriminative criterion for training neural networks to further enhance the separation performance are explored. Expand
...
1
2
3
4
5
...