Neural Speaker Diarization with Speaker-Wise Chain Rule
@article{Fujita2020NeuralSD, title={Neural Speaker Diarization with Speaker-Wise Chain Rule}, author={Yusuke Fujita and Shinji Watanabe and Shota Horiguchi and Yawen Xue and Jing Shi and Kenji Nagamatsu}, journal={ArXiv}, year={2020}, volume={abs/2006.01796} }
Speaker diarization is an essential step for processing multi-speaker audio. Although an end-to-end neural diarization (EEND) method achieved state-of-the-art performance, it is limited to a fixed number of speakers. In this paper, we solve this fixed number of speaker issue by a novel speaker-wise conditional inference method based on the probabilistic chain rule. In the proposed method, each speaker's speech activity is regarded as a single random variable, and is estimated sequentially…
24 Citations
EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers
- Computer ScienceArXiv
- 2022
Experiments show that the proposed method outperforms the baselines in terms of diarization and separation performance for both fixed and flexible numbers of speakers, as well as speaker counting performance forfiic number of speakers.
End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
An end-to-end deep network model that performs meeting diarization from single-channel audio recordings, designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
End-To-End Speaker Diarization as Post-Processing
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
This paper proposes to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method, and shows that the proposed algorithm consistently improved the performance of the state-of-the-art methods across CALLHOME, AMI, and DIHARD II datasets.
Online End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers
- Computer ScienceArXiv
- 2021
An online end-to-end diarization that can handle overlapping speech and flexible numbers of speakers is proposed that achieves comparable performance to the offline EEND method and shows better performance on the DIHARD II dataset.
End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection
- Computer Science2021 IEEE Spoken Language Technology Workshop (SLT)
- 2021
This paper proposes a novel multitask learning framework that solves speaker diarization and a desired subtask while explicitly considering the task dependency, and outperforms conventional EEND systems in terms of diarized error rate.
Robust End-to-end Speaker Diarization with Generic Neural Clustering
- Computer ScienceArXiv
- 2022
Experimental show that when integrating an attractor-based chunk-level predictor, the proposed neural clustering approach can yield better Diarization Error Rate (DER) than the constrained K-means-based clustering approaches under the mismatched conditions.
Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain
- Computer ScienceInterspeech
- 2021
This study extends the proposed conditional chain model to NAR multi-speaker ASR, and can even better than the PIT-Conformer AR model with only 1/7 latency, obtaining WERs of 19.9% and 34.3% on WSJ0-2mix and WSJ 0-3mix sets.
Dive: End-to-End Speech Diarization Via Iterative Speaker Embedding
- Computer Science2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2021
DIVE presents the diarization task as an iterative pro-cess: it repeatedly builds a representation for each speaker before predicting their voice activity conditioned on the ex-tracted representations, which intrinsically resolves the speaker ordering ambiguity without requiring the classi-cal permutation invariant training loss.
Robust End-to-End Speaker Diarization with Conformer and Additive Margin Penalty
- Computer ScienceInterspeech
- 2021
This work aims to improve the EEND-EDA model by increasing the robustness of the model by incorporating an additive margin penalty for minimizing the intra-class variance, and replacing the Transformer encoders with Conformerencoders to capture local information.
Separation Guided Speaker Diarization in Realistic Mismatched Conditions
- PhysicsArXiv
- 2021
The proposed SGSD system can significantly improve the performance of state-of-the-art CSD systems, yielding relative diarization error rate reductions of 20.2 % and 20.8 % on the development set and evaluation set, respectively.
References
SHOWING 1-10 OF 40 REFERENCES
End-to-End Neural Speaker Diarization with Permutation-Free Objectives
- Computer ScienceINTERSPEECH
- 2019
Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference, and can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi- Speaker segment labels.
Fully Supervised Speaker Diarization
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
A fully supervised speaker diarization approach, named unbounded interleaved-state recurrent neural networks (UIS-RNN), given extracted speaker-discriminative embeddings, which decodes in an online fashion while most state-of-the-art systems rely on offline clustering.
Speaker diarization using deep neural network embeddings
- Computer Science2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2017
This work proposes an alternative approach for learning representations via deep neural networks to remove the i-vector extraction process from the pipeline entirely and shows that, though this approach does not respond as well to unsupervised calibration strategies as previous systems, the incorporation of well-founded speaker priors sufficiently mitigates this shortcoming.
Speaker Diarization with LSTM
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This work combines LSTM-based d-vector audio embeddings with recent work in nonparametric clustering to obtain a state-of-the-art speaker diarization system that achieves a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while the model is trained with out- of-domain data from voice search logs.
End-to-End Neural Speaker Diarization with Self-Attention
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
The experimental results revealed that the self-attention was the key to achieving good performance and that the proposed EEND method performed significantly better than the conventional BLSTM-based method and was even better than that of the state-of-the-art x-vector clustering- based method.
Speaker diarization with plda i-vector scoring and unsupervised calibration
- Computer Science2014 IEEE Spoken Language Technology Workshop (SLT)
- 2014
A system that incorporates probabilistic linear discriminant analysis (PLDA) for i-vector scoring and uses unsupervised calibration of the PLDA scores to determine the clustering stopping criterion is proposed, and it is shown that PLDA scoring outperforms the same system with cosine scoring, and that overlapping segments reduce diarization error rate (DER) as well.
Speaker Diarization with Region Proposal Network
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This paper proposes a novel speaker diarization method: Region Proposal Network based Speaker Diarization (RPNSD), where a neural network generates overlapped speech segment proposals, and compute their speaker embeddings at the same time.
Speaker Diarization: A Review of Recent Research
- Computer ScienceIEEE Transactions on Audio, Speech, and Language Processing
- 2012
An analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research are presented.
All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This paper presents for the first time an all-neural approach to simultaneous speaker counting, diarization and source separation, using an NN-based estimator that operates in a block-online fashion and tracks speakers even if they remain silent for a number of time blocks, thus learning a stable output order for the separated sources.
Speaker Diarization with Enhancing Speech for the First DIHARD Challenge
- Computer ScienceINTERSPEECH
- 2018
This work designs a novel speaker diarization system for the first DIHARD challenge by integrating several important modules of speech denoising, speech activity detection, i-vector design, and scoring strategy and adopts a residual convolutional neural network trained on large dataset including more than 30,000 people.