Deep attractor network for single-microphone speaker separation

@article{Chen2017DeepAN,
  title={Deep attractor network for single-microphone speaker separation},
  author={Zhuo Chen and Yi Luo and Nima Mesgarani},
  journal={2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2017},
  pages={246-250}
}
  • Zhuo Chen, Yi Luo, N. Mesgarani
  • Published 27 November 2016
  • Computer Science, Medicine
  • 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Despite the overwhelming success of deep learning in various speech processing tasks, the problem of separating simultaneous speakers in a mixture remains challenging. [...] Key Method Attractor points in this study are created by finding the centroids of the sources in the embedding space, which are subsequently used to determine the similarity of each bin in the mixture to each source. The network is then trained to minimize the reconstruction error of each source by optimizing the embeddings.Expand
Speaker-Independent Speech Separation With Deep Attractor Network
TLDR
This work proposes a novel deep learning framework for speech separation that uses a neural network to project the time-frequency representation of the mixture signal into a high-dimensional embedding space and proposes three methods for finding the attractors for each source in the embedded space and compares their advantages and limitations. Expand
Online Deep Attractor Network for Real-time Single-channel Speech Separation
  • Cong Han, Yi Luo, N. Mesgarani
  • Computer Science
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
Experimental results show that ODANet can achieve a similar separation accuracy as the noncausal DANet in both two speaker and three speaker speech separation problems, which makes it a suitable candidate for applications that require robust real-time speech processing. Expand
Deep Extractor Network for Target Speaker Recovery From Single Channel Speech Mixtures
TLDR
A novel "deep extractor network" which creates an extractor point for the target speaker in a canonical high dimensional embedding space, and pulls together the time-frequency bins corresponding to thetarget speaker. Expand
Listening to Each Speaker One by One with Recurrent Selective Hearing Networks
TLDR
This paper casts the source separation problem as a recursive multi-pass source extraction problem based on a recurrent neural network (RNN) that can learn and determine how many computational steps/iterations have to be performed depending on the input signals. Expand
Speaker Attractor Network: Generalizing Speech Separation to Unseen Numbers of Sources
TLDR
Experimental results show that the proposed method significantly improves the separation performance when generalizing to an unseen number of speakers, and can separate up to five speakers even the model is only trained on two-speaker mixtures. Expand
Cracking the cocktail party problem by multi-beam deep attractor network
  • Zhuo Chen, Jinyu Li, +4 authors Y. Gong
  • Computer Science, Engineering
  • 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
  • 2017
TLDR
The proposed system largely improves the state of the art in speech separation, achieving 11.02 dB average signal-to-distortion ratio improvement for 4, 3 and 2 overlapped speaker mixtures, which is comparable to the performance of a minimum variance distortionless response beamformer. Expand
Deep Speech Denoising with Vector Space Projections
TLDR
An algorithm to denoise speakers from a single microphone in the presence of non-stationary and dynamic noise by leveraging embedding spaces produced with source-contrastive estimation, a technique derived from negative sampling techniques in natural language processing, while simultaneously obtaining a continuous inference mask. Expand
Improved Source Counting and Separation for Monaural Mixture
TLDR
A novel model of single-channel multi-speaker separation by jointly learning the time-frequency feature and the unknown number of speakers is proposed, which achieves the state-of-the-art separation results on multi- Speaker mixtures in terms of scale-invariant signal- to-noise ratio improvement (SI-SNRi) and signal-to-distortion ratio improved (SDRi). Expand
Deep Attractor Networks for Speaker Re-Identification and Blind Source Separation
TLDR
This model structure improves the signal to distortion ratio (SDR) over a DAN baseline and provides up to 61% and up to 34% relative reduction in permutation error rate and re-identification error rate compared to an i-vector baseline, respectively. Expand
Multi-channel Speech Separation Using Deep Embedding Model with Multilayer Bootstrap Networks
TLDR
A variant of DPCL is proposed, named DPCL++, by applying a recent unsupervised deep learning method---multilayer bootstrap networks(MBN)---to further reduce the noise and small variations of the embedding vectors in an un supervised way in the test stage, which fascinates k-means to produce a good result. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 24 REFERENCES
Deep clustering: Discriminative embeddings for segmentation and separation
TLDR
Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures. Expand
Single-Channel Multi-Speaker Separation Using Deep Clustering
TLDR
This paper significantly improves upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline, and produces unprecedented performance on a challenging speech separation. Expand
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
TLDR
This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages. Expand
The MERL/SRI system for the 3RD CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition
This paper introduces the MERL/SRI system designed for the 3rd CHiME speech separation and recognition challenge (CHiME-3). Our proposed system takes advantage of recurrent neural networks (RNNs)Expand
Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks
TLDR
This paper describes systems that learn to attend to different places in the input, for each element of the output, for a variety of tasks: machine translation, image caption generation, video clip description, and speech recognition. Expand
Long short-term memory recurrent neural network architectures for large scale acoustic modeling
TLDR
The first distributed training of LSTM RNNs using asynchronous stochastic gradient descent optimization on a large cluster of machines is introduced and it is shown that a two-layer deep LSTm RNN where each L STM layer has a linear recurrent projection layer can exceed state-of-the-art speech recognition performance. Expand
Speech enhancement based on deep denoising autoencoder
TLDR
Experimental results show that adding depth of the DAE consistently increase the performance when a large training data set is given, and compared with a minimum mean square error based speech enhancement algorithm, the proposed denoising DAE provided superior performance on the three objective evaluations. Expand
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
TLDR
A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs. Expand
Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks
TLDR
Several integration architectures are proposed and tested, including a pipeline architecture of L STM-based SE and ASR with sequence training, an alternating estimation architecture, and a multi-task hybrid LSTM network architecture. Expand
Acoustic modelling with CD-CTC-SMBR LSTM RNNS
TLDR
This paper describes a series of experiments to extend the application of Context-Dependent long short-term memory recurrent neural networks (RNNs) trained with Connectionist Temporal Classification (CTC) and sMBR loss and investigates transferring knowledge from one network to another through alignments. Expand
...
1
2
3
...