• Corpus ID: 235652403

Online Self-Attentive Gated RNNs for Real-Time Speaker Separation

@article{Kabeli2021OnlineSG,
  title={Online Self-Attentive Gated RNNs for Real-Time Speaker Separation},
  author={Ori Kabeli and Yossi Adi and Zhenyu Tang and Buye Xu and Anurag Kumar},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.13493}
}
Deep neural networks have recently shown great success in the task of blind source separation, both under monaural and binaural settings. Although these methods were shown to produce high-quality separations, they were mainly applied under offline settings, in which the model has access to the full input signal while separating the signal. In this study, we convert a non-causal state-of-the-art separation model into a causal and real-time model and evaluate its performance under both online and… 

Figures and Tables from this paper

TPARN: Triple-path Attentive Recurrent Network for Time-domain Multichannel Speech Enhancement
TLDR
Experimental results demonstrate the superiority of TPARN over existing state-of-the-art approaches for multichannel speech enhancement in the time domain.

References

SHOWING 1-10 OF 33 REFERENCES
SAGRNN: Self-Attentive Gated RNN For Binaural Speaker Separation With Interaural Cue Preservation
TLDR
This study extends a newly-developed gated recurrent neural network for monaural separation by additionally incorporating self-attention mechanisms and dense connectivity and develops an end-to-end multiple-input multiple-output system, which directly maps from the binaural waveform of the mixture to those of the speech signals.
TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
Time-domain Audio Separation Network (TasNet) is proposed, which outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output.
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
TLDR
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation
TLDR
The Wave-U-Net is proposed, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales and indicates that its architecture yields a performance comparable to a state-of-the-art spectrogram-based U- net architecture, given the same data.
An End-to-end Architecture of Online Multi-channel Speech Separation
TLDR
Ex-perimental results show that the proposed system achieves com-parable performance in an offline evaluation with the originalseparate processing-based pipeline, while producing remark-able improvements in an online evaluation.
Wavesplit: End-to-End Speech Separation by Speaker Clustering
TLDR
Wavesplit redefines the state-of-the-art on clean mixtures of 2 or 3 speakers, as well as in noisy and reverberated settings, and set a new benchmark on the recent LibriMix dataset.
All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis
TLDR
This paper presents for the first time an all-neural approach to simultaneous speaker counting, diarization and source separation, using an NN-based estimator that operates in a block-online fashion and tracks speakers even if they remain silent for a number of time blocks, thus learning a stable output order for the separated sources.
Deep attractor network for single-microphone speaker separation
  • Zhuo Chen, Yi Luo, N. Mesgarani
  • Computer Science
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
TLDR
A novel deep learning framework for single channel speech separation by creating attractor points in high dimensional embedding space of the acoustic signals which pull together the time-frequency bins corresponding to each source.
WHAM!: Extending Speech Separation to Noisy Environments
TLDR
The WSJ0 Hipster Ambient Mixtures dataset is created, consisting of two speaker mixtures from the wsj0-2mix dataset combined with real ambient noise samples, to benchmark various speech separation architectures and objective functions to evaluate their robustness to noise.
End-To-End Source Separation With Adaptive Front-Ends
TLDR
An auto-encoder neural network is developed that can act as an equivalent to short-time front-end transforms and demonstrate the ability of the network to learn optimal, real-valued basis functions directly from the raw waveform of a signal.
...
...