Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers

@article{vonNeumann2021GraphPITGP,
  title={Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers},
  author={Thilo von Neumann and Keisuke Kinoshita and Christoph Boeddeker and Marc Delcroix and Reinhold Haeb-Umbach},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.14446}
}
Automatic transcription of meetings requires handling of overlapped speech, which calls for continuous speech separation (CSS) systems. The uPIT criterion was proposed for utterance-level separation with neural networks and introduces the constraint that the total number of speakers must not exceed the number of output channels. When processing meetinglike data in a segment-wise manner, i.e., by separating overlapping segments independently and stitching adjacent segments to continuous output… Expand

Figures and Tables from this paper

Speeding Up Permutation Invariant Training for Source Separation
TLDR
This work presents a decomposition of the PIT criterion into the computation of a matrix and a strictly monotonously increasing function so that the permutation or assignment problem can be solved efficiently with several search algorithms. Expand
VarArray: Array-Geometry-Agnostic Continuous Speech Separation
TLDR
VarArray is proposed, an arraygeometry-agnostic speech separation neural network model that is applicable to any number of microphones without retraining while leveraging the nonlinear correlation between the input channels. Expand
SA-SDR: A novel loss function for separation of meeting style data
TLDR
This work proposes to switch from a mean over the SDRs of each individual output channel to a global SDR over all output channels at the same time, which it calls source-aggregated SDR (SA-SDR), which makes the loss robust against silence and perfect reconstruction as long as at least one reference signal is not silent. Expand
Separating Long-Form Speech with Group-Wise Permutation Invariant Training
  • Wangyou Zhang, Zhuo Chen, +8 authors Furu Wei
  • Engineering, Computer Science
  • ArXiv
  • 2021
TLDR
A novel training scheme named Group-PIT is proposed, which allows direct training of the speech separation models on the long-form speech with a low computational cost for label assignment and demonstrates the effectiveness of the proposed approaches, especially in dealing with a very long speech input. Expand

References

SHOWING 1-10 OF 25 REFERENCES
Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks
TLDR
This report is the first report that applies overlapped speech recognition to unconstrained real meeting audio and outperforms a system based on a state-of-the-art neural mask-based beamformer by 10.8%. Expand
All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis
TLDR
This paper presents for the first time an all-neural approach to simultaneous speaker counting, diarization and source separation, using an NN-based estimator that operates in a block-online fashion and tracks speakers even if they remain silent for a number of time blocks, thus learning a stable output order for the separated sources. Expand
Deep clustering: Discriminative embeddings for segmentation and separation
TLDR
Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures. Expand
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
TLDR
This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages. Expand
Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks
In this paper, we propose the utterance-level permutation invariant training (uPIT) technique. uPIT is a practically applicable, end-to-end, deep-learning-based solution for speaker independentExpand
Continuous Speech Separation: Dataset and Analysis
  • Zhuo Chen, T. Yoshioka, +5 authors Jinyu Li
  • Computer Science, Engineering
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
A new real recording dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate conversations and capturing the audio replays with far-field microphones, which helps researchers from developing systems that can be readily applied to real scenarios. Expand
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science, Medicine
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
TLDR
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures. Expand
TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science, Engineering
  • 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
Time-domain Audio Separation Network (TasNet) is proposed, which outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output. Expand
The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines
TLDR
The 5th CHiME Challenge is introduced, which considers the task of distant multi-microphone conversational ASR in real home environments and describes the data collection procedure, the task, and the baseline systems for array synchronization, speech enhancement, and conventional and end-to-end ASR. Expand
Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models
TLDR
This paper investigates the use of target-speaker automatic speech recognition for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings and proposes an iterative method, in which the estimation of speaker embeddings and TS-ASR based on the estimated speaker embeddeddings are alternately executed. Expand
...
1
2
3
...