Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers

  title={Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers},
  author={Thilo von Neumann and Keisuke Kinoshita and Christoph Boeddeker and Marc Delcroix and Reinhold Haeb-Umbach},
Automatic transcription of meetings requires handling of overlapped speech, which calls for continuous speech separation (CSS) systems. The uPIT criterion was proposed for utterance-level separation with neural networks and introduces the constraint that the total number of speakers must not exceed the number of output channels. When processing meetinglike data in a segment-wise manner, i.e., by separating overlapping segments independently and stitching adjacent segments to continuous output… 

Figures and Tables from this paper

Speeding Up Permutation Invariant Training for Source Separation
This work presents a decomposition of the PIT criterion into the computation of a matrix and a strictly monotonously increasing function so that the permutation or assignment problem can be solved efficiently with several search algorithms.
Separating Long-Form Speech with Group-Wise Permutation Invariant Training
A novel training scheme named Group-PIT is proposed, which allows direct training of the speech separation models on the long-form speech with a low computational cost for label assignment and demonstrates the effectiveness of the proposed approaches, especially in dealing with a very long speech input.
Multi-turn RNN-T for streaming recognition of multi-party speech
This work proposes a novel multi-turn RNN-T (MT-RNN-t) model with an overlap-based target arrangement strategy that generalizes to an arbitrary number of speakers without changes in the model architecture and addresses a few challenges in previous work on multi-speaker recurrent neural network transducer (MS-Rnn-T).
VarArray: Array-Geometry-Agnostic Continuous Speech Separation
VarArray is proposed, an arraygeometry-agnostic speech separation neural network model that is applicable to any number of microphones without retraining while leveraging the nonlinear correlation between the input channels.
SA-SDR: A novel loss function for separation of meeting style data
This work proposes to switch from a mean over the SDRs of each individual output channel to a global SDR over all output channels at the same time, which it calls source-aggregated SDR (SA-SDR), which makes the loss robust against silence and perfect reconstruction as long as at least one reference signal is not silent.
SkiM: Skipping Memory LSTM for Low-Latency Real-Time Continuous Speech Separation
  • Chenda Li, Lei Yang, Weiqin Wang, Yanmin Qian
  • Engineering, Computer Science
  • 2022
Continuous speech separation for meeting pre-processing has recently become a focused research topic. Compared to the data in utterance-level speech separation, the meeting-style audio stream lasts


Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks
This report is the first report that applies overlapped speech recognition to unconstrained real meeting audio and outperforms a system based on a state-of-the-art neural mask-based beamformer by 10.8%.
All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis
This paper presents for the first time an all-neural approach to simultaneous speaker counting, diarization and source separation, using an NN-based estimator that operates in a block-online fashion and tracks speakers even if they remain silent for a number of time blocks, thus learning a stable output order for the separated sources.
Deep clustering: Discriminative embeddings for segmentation and separation
Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures.
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages.
Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks
In this paper, we propose the utterance-level permutation invariant training (uPIT) technique. uPIT is a practically applicable, end-to-end, deep-learning-based solution for speaker independent
Continuous Speech Separation: Dataset and Analysis
  • Zhuo Chen, T. Yoshioka, +5 authors Jinyu Li
  • Computer Science, Engineering
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
A new real recording dataset, called LibriCSS, is derived from LibriSpeech by concatenating the corpus utterances to simulate conversations and capturing the audio replays with far-field microphones, which helps researchers from developing systems that can be readily applied to real scenarios.
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science, Medicine
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science, Engineering
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
Time-domain Audio Separation Network (TasNet) is proposed, which outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output.
The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines
The 5th CHiME Challenge is introduced, which considers the task of distant multi-microphone conversational ASR in real home environments and describes the data collection procedure, the task, and the baseline systems for array synchronization, speech enhancement, and conventional and end-to-end ASR.
Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models
This paper investigates the use of target-speaker automatic speech recognition for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings and proposes an iterative method, in which the estimation of speaker embeddings and TS-ASR based on the estimated speaker embeddeddings are alternately executed.