• Corpus ID: 219179559

Neural Speaker Diarization with Speaker-Wise Chain Rule

@article{Fujita2020NeuralSD,
  title={Neural Speaker Diarization with Speaker-Wise Chain Rule},
  author={Yusuke Fujita and Shinji Watanabe and Shota Horiguchi and Yawen Xue and Jing Shi and Kenji Nagamatsu},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.01796}
}
Speaker diarization is an essential step for processing multi-speaker audio. Although an end-to-end neural diarization (EEND) method achieved state-of-the-art performance, it is limited to a fixed number of speakers. In this paper, we solve this fixed number of speaker issue by a novel speaker-wise conditional inference method based on the probabilistic chain rule. In the proposed method, each speaker's speech activity is regarded as a single random variable, and is estimated sequentially… 

Figures, Tables, and Topics from this paper

End-To-End Diarization for Variable Number of Speakers with Local-Global Networks and Discriminative Speaker Embeddings
TLDR
An end-to-end deep network model that performs meeting diarization from single-channel audio recordings, designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
End-To-End Speaker Diarization as Post-Processing
TLDR
This paper proposes to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method, and shows that the proposed algorithm consistently improved the performance of the state-of-the-art methods across CALLHOME, AMI, and DIHARD II datasets.
Online End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers
TLDR
An online end-to-end diarization that can handle overlapping speech and flexible numbers of speakers is proposed that achieves comparable performance to the offline EEND method and shows better performance on the DIHARD II dataset.
End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection
TLDR
This paper proposes a novel multitask learning framework that solves speaker diarization and a desired subtask while explicitly considering the task dependency, and outperforms conventional EEND systems in terms of diarized error rate.
DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding
TLDR
This work introduces DIVE, an end-to-end speaker diarization algorithm that does not rely on pretrained speaker representations and optimizes all parameters of the system with a multi-speaker voice activity loss.
Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain
TLDR
This study extends the proposed conditional chain model to NAR multi-speaker ASR, and can even better than the PIT-Conformer AR model with only 1/7 latency, obtaining WERs of 19.9% and 34.3% on WSJ0-2mix and WSJ 0-3mix sets.
Robust End-to-End Speaker Diarization with Conformer and Additive Margin Penalty
TLDR
This work aims to improve the EEND-EDA model by increasing the robustness of the model by incorporating an additive margin penalty for minimizing the intra-class variance, and replacing the Transformer encoders with Conformerencoders to capture local information.
Separation Guided Speaker Diarization in Realistic Mismatched Conditions
TLDR
The proposed SGSD system can significantly improve the performance of state-of-the-art CSD systems, yielding relative diarization error rate reductions of 20.2 % and 20.8 % on the development set and evaluation set, respectively.
Encoder-Decoder Based Attractor Calculation for End-to-End Neural Diarization
TLDR
This paper introduces encoder-decoder-based attractor calculation module (EDA) to EEND and proposes a method that aligns the estimated diarization results with the results of an external speech activity detector, which enables fair comparison against pipeline approaches.
Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization
TLDR
This paper proposes an iterative pseudolabel method for EEND, which trains the model using unlabeled data of a target condition, and proposes a committeebased training method to improve the performance of EEND.
...
1
2
...

References

SHOWING 1-10 OF 40 REFERENCES
End-to-End Neural Speaker Diarization with Permutation-Free Objectives
TLDR
Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference, and can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi- Speaker segment labels.
Fully Supervised Speaker Diarization
TLDR
A fully supervised speaker diarization approach, named unbounded interleaved-state recurrent neural networks (UIS-RNN), given extracted speaker-discriminative embeddings, which decodes in an online fashion while most state-of-the-art systems rely on offline clustering.
Speaker diarization using deep neural network embeddings
TLDR
This work proposes an alternative approach for learning representations via deep neural networks to remove the i-vector extraction process from the pipeline entirely and shows that, though this approach does not respond as well to unsupervised calibration strategies as previous systems, the incorporation of well-founded speaker priors sufficiently mitigates this shortcoming.
Speaker Diarization with LSTM
TLDR
This work combines LSTM-based d-vector audio embeddings with recent work in nonparametric clustering to obtain a state-of-the-art speaker diarization system that achieves a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while the model is trained with out- of-domain data from voice search logs.
End-to-End Neural Speaker Diarization with Self-Attention
TLDR
The experimental results revealed that the self-attention was the key to achieving good performance and that the proposed EEND method performed significantly better than the conventional BLSTM-based method and was even better than that of the state-of-the-art x-vector clustering- based method.
Speaker diarization with plda i-vector scoring and unsupervised calibration
TLDR
A system that incorporates probabilistic linear discriminant analysis (PLDA) for i-vector scoring and uses unsupervised calibration of the PLDA scores to determine the clustering stopping criterion is proposed, and it is shown that PLDA scoring outperforms the same system with cosine scoring, and that overlapping segments reduce diarization error rate (DER) as well.
Speaker Diarization with Region Proposal Network
  • Zili Huang, Shinji Watanabe, +4 authors S. Khudanpur
  • Computer Science, Engineering
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
This paper proposes a novel speaker diarization method: Region Proposal Network based Speaker Diarization (RPNSD), where a neural network generates overlapped speech segment proposals, and compute their speaker embeddings at the same time.
Speaker Diarization: A Review of Recent Research
TLDR
An analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research are presented.
All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis
TLDR
This paper presents for the first time an all-neural approach to simultaneous speaker counting, diarization and source separation, using an NN-based estimator that operates in a block-online fashion and tracks speakers even if they remain silent for a number of time blocks, thus learning a stable output order for the separated sources.
Speaker Diarization with Enhancing Speech for the First DIHARD Challenge
TLDR
This work designs a novel speaker diarization system for the first DIHARD challenge by integrating several important modules of speech denoising, speech activity detection, i-vector design, and scoring strategy and adopts a residual convolutional neural network trained on large dataset including more than 30,000 people.
...
1
2
3
4
...