Active Learning Based Constrained Clustering For Speaker Diarization


Most speaker diarization research has focused on unsupervised scenarios, where no human supervision is available. However, in many real-world applications, a certain amount of human input could be expected, especially when minimal human supervision brings significant performance improvement. In this study, we propose an active learning based bottom-up speaker clustering algorithm to effectively improve speaker diarization performance with limited human input. Specifically, the proposed active learning based speaker clustering has two different stages: <italic>explore </italic> and <italic>constrained clustering</italic>. The <italic>explore</italic> stage is to quickly discover at least one sample for each speaker for boosting speaker clustering process with reliable initial speaker clusters. After discovering all, or a majority, of the involved speakers during <italic>explore</italic> stage, the <italic> constrained clustering</italic> is performed. <italic>Constrained clustering</italic> is similar to traditional bottom-up clustering process with an important difference that the clusters created during <italic>explore</italic> stage are restricted from merging with each other. <italic>Constrained clustering</italic> continues until only the clusters generated from the <italic>explore</italic> stage are left. Since the objective of active learning based speaker clustering algorithm is to provide good initial speaker models, performance saturates as soon as sufficient examples are ensured for each cluster. To further improve diarization performance with increasing human input, we propose a second method which actively select speech segments that account for the largest expected speaker error from existing cluster assignments for human evaluation and reassignment. The algorithms are evaluated on our recently created Apollo Mission Control Center dataset as well as augmented multiparty interaction meeting corpus. The results indicate that the proposed active learning algorithms are able to reduce diarization error rate significantly with a relatively small amount of human supervision.

11 Figures and Tables

Cite this paper

@article{Yu2017ActiveLB, title={Active Learning Based Constrained Clustering For Speaker Diarization}, author={Chengzhu Yu and John H. L. Hansen}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, year={2017}, volume={25}, pages={2188-2198} }