Spatial-aware Speaker Diarization for Multi-channel Multi-party Meeting
@inproceedings{Wang2022SpatialawareSD, title={Spatial-aware Speaker Diarization for Multi-channel Multi-party Meeting}, author={Jie Wang and Yuji Liu and Binling Wang and Yiming Zhi and Song Li and Shipeng Xia and Jiayang Zhang and Feng Tong and Lin Li and Qingyang Hong}, booktitle={Interspeech}, year={2022} }
This paper describes a spatial-aware speaker diarization system for the multi-channel multi-party meeting. The diarization system obtains direction information of speaker by microphone array. Speaker-spatial embedding is generated by xvector and s-vector derived from superdirective beamforming (SDB) which makes the embedding more robust. Specifically, we propose a novel multi-channel sequence-to-sequence neural network architecture named discriminative multi-stream neural network (DMSNet) which…
References
SHOWING 1-10 OF 21 REFERENCES
Multimodal Speaker Diarization of Real-World Meetings Using D-Vectors With Spatial Features
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
A novel approach to multimodal speaker diarization that combines d-vectors with spatial information derived from performing beamforming given a multi-channel microphone array and is evaluated on the AMI Meeting Corpus and an internal dataset of real-world conversations.
M2Met: The Icassp 2022 Multi-Channel Multi-Party Meeting Transcription Challenge
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
The AliMeeting corpus, which consists of 120 hours of recorded Mandarin meeting data, is made available and the ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) is launched with two tracks, namely speaker diarization and multi-speaker ASR, aiming to provide a common testbed for meeting rich transcription and promote reproducible research in this field.
Investigation of Spatial-Acoustic Features for Overlapping Speech Detection in Multiparty Meetings
- PhysicsInterspeech
- 2021
In this paper, we propose an overlapping speech detection (OSD) system for real multiparty meetings. Different from previous works on single-channel recordings or simulated data, we conduct research…
Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis
- Computer Science2021 IEEE Spoken Language Technology Workshop (SLT)
- 2021
Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module, and an end-to-end modular system for the LibriCSS meeting data is proposed.
Overlapped Speech Detection Based on Spectral and Spatial Feature Fusion
- PhysicsInterspeech
- 2021
Overlapped speech is widely present in conversations and can cause significant performance degradation on speech processing such as diarization, enhancement, and recognition. Detection of overlapped…
Acoustic Beamforming for Speaker Diarization of Meetings
- PhysicsIEEE Transactions on Audio, Speech, and Language Processing
- 2007
The use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain and shows improvements in a speech recognition task.
Deep Speaker Embedding Extraction with Channel-Wise Feature Responses and Additive Supervision Softmax Loss Function
- Computer ScienceINTERSPEECH
- 2019
This paper introduces the recently proposed “squeezeand-excitation” (SE) module for image classification by introducing the SE blocks in the deep residual networks (ResNet-SE) and proposes a new loss function, namely additive supervision softmax (AS-Softmax), to make full use of the prior knowledge of the mis-classified samples at training stage.
Front-end processing for the CHiME-5 dinner party scenario
- Computer Science5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018)
- 2018
This contribution presents a speech enhancement system for the CHiME-5 Dinner Party Scenario. The front-end employs multi-channel linear time-variant filtering and achieves its gains without the use…
Detecting and Counting Overlapping Speakers in Distant Speech Scenarios
- Computer ScienceINTERSPEECH
- 2020
A Temporal Convolu-tional Network (TCN) based method is designed to address the problem of detecting the activity and counting overlapping speakers in distant-microphone recordings, and it is shown that TCNs significantly outperform state-of-the-art methods on two real-world distant speech datasets.
Overlapped Speech Detection and Competing Speaker Counting–‐Humans Versus Deep Learning
- PhysicsIEEE Journal of Selected Topics in Signal Processing
- 2019
A perception study that evaluates participants’ ability to accurately count multiple speakers in a single channel audio file and analyzes the influence of listening time and of hearing familiar voices significantly extends the findings in existing literature.