• Corpus ID: 246823693

The xmuspeech system for multi-channel multi-party meeting transcription challenge

  title={The xmuspeech system for multi-channel multi-party meeting transcription challenge},
  author={Jie Wang and Yuji Liu and Binling Wang and Yiming Zhi and Song Li and Shipeng Xia and Jiayang Zhang and Lin Li and Qingyang Hong and Feng Tong},
This paper describes the system developed by the XMUSPEECH team for the Multi-channel Multi-party Meeting Transcription Challenge (M2MeT). For the speaker diarization task, we propose a multi-channel speaker diarization system that obtains spatial information of speaker by Difference of Arrival (DOA) technology. Speaker-spatial embedding is generated by x-vector and s-vector derived from Filter-and-Sum Beamforming (FSB) which makes the embedding more robust. Specifically, we propose a novel… 

Figures and Tables from this paper



M2Met: The Icassp 2022 Multi-Channel Multi-Party Meeting Transcription Challenge

  • Fan YuShiliang Zhang Hui Bu
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
The AliMeeting corpus, which consists of 120 hours of recorded Mandarin meeting data, is made available and the ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) is launched with two tracks, namely speaker diarization and multi-speaker ASR, aiming to provide a common testbed for meeting rich transcription and promote reproducible research in this field.

Multimodal Speaker Diarization of Real-World Meetings Using D-Vectors With Spatial Features

A novel approach to multimodal speaker diarization that combines d-vectors with spatial information derived from performing beamforming given a multi-channel microphone array and is evaluated on the AMI Meeting Corpus and an internal dataset of real-world conversations.

DOVER-Lap: A Method for Combining Overlap-Aware Diarization Outputs

The method, DOVER-Lap, is inspired from the recently proposed DOVER algorithm, but is designed to handle overlapping segments in diarization outputs, and modify the pair-wise incremental label mapping strategy used in DOVER.

AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario

AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario, is presented, and is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community.

Deep Speaker Embedding Extraction with Channel-Wise Feature Responses and Additive Supervision Softmax Loss Function

This paper introduces the recently proposed “squeezeand-excitation” (SE) module for image classification by introducing the SE blocks in the deep residual networks (ResNet-SE) and proposes a new loss function, namely additive supervision softmax (AS-Softmax), to make full use of the prior knowledge of the mis-classified samples at training stage.

Pyannote.Audio: Neural Building Blocks for Speaker Diarization

This work introduces pyannote.audio, an open-source toolkit written in Python for speaker diarization, which provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker darization pipelines.

CN-Celeb: A Challenging Chinese Speaker Recognition Dataset

  • Yue FanJiawen Kang Dong Wang
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
CN-Celeb is presented, a large-scale speaker recognition dataset collected ‘in the wild’ that contains more than 130,000 utterances from 1,000 Chinese celebrities, and covers 11 different genres in real world.

Conformer: Convolution-augmented Transformer for Speech Recognition

This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.

A study on data augmentation of reverberant speech for robust speech recognition

It is found that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added, and the trained acoustic models not only perform well in the distant- talking scenario but also provide better results in the close-talking scenario.

Squeeze-and-Excitation Networks

This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.