Spatial-aware Speaker Diarization for Multi-channel Multi-party Meeting

  title={Spatial-aware Speaker Diarization for Multi-channel Multi-party Meeting},
  author={Jie Wang and Yuji Liu and Binling Wang and Yiming Zhi and Song Li and Shipeng Xia and Jiayang Zhang and Feng Tong and Lin Li and Qingyang Hong},
This paper describes a spatial-aware speaker diarization system for the multi-channel multi-party meeting. The diarization system obtains direction information of speaker by microphone array. Speaker-spatial embedding is generated by xvector and s-vector derived from superdirective beamforming (SDB) which makes the embedding more robust. Specifically, we propose a novel multi-channel sequence-to-sequence neural network architecture named discriminative multi-stream neural network (DMSNet) which… 

Figures and Tables from this paper



Multimodal Speaker Diarization of Real-World Meetings Using D-Vectors With Spatial Features

A novel approach to multimodal speaker diarization that combines d-vectors with spatial information derived from performing beamforming given a multi-channel microphone array and is evaluated on the AMI Meeting Corpus and an internal dataset of real-world conversations.

M2Met: The Icassp 2022 Multi-Channel Multi-Party Meeting Transcription Challenge

  • Fan YuShiliang Zhang Hui Bu
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
The AliMeeting corpus, which consists of 120 hours of recorded Mandarin meeting data, is made available and the ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) is launched with two tracks, namely speaker diarization and multi-speaker ASR, aiming to provide a common testbed for meeting rich transcription and promote reproducible research in this field.

Investigation of Spatial-Acoustic Features for Overlapping Speech Detection in Multiparty Meetings

In this paper, we propose an overlapping speech detection (OSD) system for real multiparty meetings. Different from previous works on single-channel recordings or simulated data, we conduct research

Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis

Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module, and an end-to-end modular system for the LibriCSS meeting data is proposed.

Overlapped Speech Detection Based on Spectral and Spatial Feature Fusion

Overlapped speech is widely present in conversations and can cause significant performance degradation on speech processing such as diarization, enhancement, and recognition. Detection of overlapped

Acoustic Beamforming for Speaker Diarization of Meetings

The use of classic acoustic beamforming techniques is proposed together with several novel algorithms to create a complete frontend for speaker diarization in the meeting room domain and shows improvements in a speech recognition task.

Deep Speaker Embedding Extraction with Channel-Wise Feature Responses and Additive Supervision Softmax Loss Function

This paper introduces the recently proposed “squeezeand-excitation” (SE) module for image classification by introducing the SE blocks in the deep residual networks (ResNet-SE) and proposes a new loss function, namely additive supervision softmax (AS-Softmax), to make full use of the prior knowledge of the mis-classified samples at training stage.

Front-end processing for the CHiME-5 dinner party scenario

This contribution presents a speech enhancement system for the CHiME-5 Dinner Party Scenario. The front-end employs multi-channel linear time-variant filtering and achieves its gains without the use

Detecting and Counting Overlapping Speakers in Distant Speech Scenarios

A Temporal Convolu-tional Network (TCN) based method is designed to address the problem of detecting the activity and counting overlapping speakers in distant-microphone recordings, and it is shown that TCNs significantly outperform state-of-the-art methods on two real-world distant speech datasets.

Overlapped Speech Detection and Competing Speaker Counting–‐Humans Versus Deep Learning

A perception study that evaluates participants’ ability to accurately count multiple speakers in a single channel audio file and analyzes the influence of listening time and of hearing familiar voices significantly extends the findings in existing literature.