Personal VAD: Speaker-Conditioned Voice Activity Detection
@inproceedings{Ding2019PersonalVS, title={Personal VAD: Speaker-Conditioned Voice Activity Detection}, author={Shaojin Ding and Quan Wang and Shuo-yiin Chang and Li Wan and Ignacio Lopez-Moreno}, booktitle={The Speaker and Language Recognition Workshop}, year={2019} }
In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption, especially in scenarios where a keyword detector is unpreferable. We achieve this by training a VAD-alike neural network that is conditioned on the target speaker…
39 Citations
Speaker Activity Driven Neural Speech Extraction
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
It is shown that this simple yet practical approach can successfully extract speakers after diarization, which results in improved ASR performance, especially in high overlapping conditions, with a relative word error rate reduction of up to 25%.
Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario
- Computer ScienceINTERSPEECH
- 2020
A novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame, outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.
Enrollment-less training for personalized voice activity detection
- Computer ScienceInterspeech
- 2021
A novel personalized voice activity detection (PVAD) learning method that does not require enrollment data during training, called enrollment-less training, which enables PVAD training without enrollment speech.
Target-Speaker Voice Activity Detection with Improved i-Vector Estimation for Unknown Number of Speaker
- Computer ScienceInterspeech
- 2021
This paper extends TS-VAD to speaker diarization with unknown numbers of speakers, and proposes a fusion-based method to combine frame-level decisions from the systems for an improved initialization.
Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition
- Computer ScienceINTERSPEECH
- 2022
This work presents Personal VAD 2.0, a personalized voice activity detector that detects the voice activity of a target speaker, as part of a streaming on-device ASR system.
Multi-User Voicefilter-Lite via Attentive Speaker Embedding
- Computer Science2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2021
The experiments show that, with up to four enrolled users, multi-user VoiceFilter-Lite is able to significantly reduce speech recognition and speaker verification errors when there is overlapping speech, without affecting performance under other acoustic conditions.
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition
- Computer ScienceINTERSPEECH
- 2020
This work introduces VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system, and shows that such a model can be quantized as a 8-bit integer model and run in realtime.
Voice Activity Detection in the Wild: A Data-Driven Approach Using Teacher-Student Training
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2021
This work proposes a data-driven teacher-student approach for VAD, which utilizes vast and unconstrained audio data for training, enabling the utilization of any real-world, potentially noisy dataset.
Polynomial Eigenvalue Decomposition-Based Target Speaker Voice Activity Detection in the Presence of Competing Talkers
- Computer Science2022 International Workshop on Acoustic Signal Enhancement (IWAENC)
- 2022
A polynomial eigenvalue decomposition-based target-speaker VAD algorithm to detect unseen target speakers in the presence of competing talkers and is consistently among the best in F1 and balanced accuracy scores over the investigated range of signal to interference ratio (SIR).
Sparsely Overlapped Speech Training in the Time Domain: Joint Learning of Target Speech Separation and Personal VAD Benefits
- Computer Science2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
- 2021
The weighted SI -SNR loss is proposed, together with the joint learning of target speech separation and personal VAD, which imposes a weight factor that is proportional to the target speaker's duration and returns zero when the target Speaker is absent.
36 References
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
- Computer ScienceNeurIPS
- 2018
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
Temporal Modeling Using Dilated Convolution and Gating for Voice-Activity-Detection
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This paper proposes an alternative architecture that does not suffer from saturation problems by modeling temporal variations through a stateless dilated convolution neural network (CNN), which differs from conventional CNNs in three respects: it uses dilated causal convolution, gated activations and residual connections.
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
- Computer ScienceINTERSPEECH
- 2019
A novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker, by training two separate neural networks.
Sample Efficient Adaptive Text-to-Speech
- Computer ScienceICLR
- 2019
Three strategies are introduced and benchmark three strategies at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.
Deep Speaker: an End-to-End Neural Speaker Embedding System
- Computer Science, PhysicsArXiv
- 2017
Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline.
All for one: feature combination for highly channel-degraded speech activity detection
- Computer ScienceINTERSPEECH
- 2013
This paper presents a feature combination approach to improve SAD on highly channel degraded speech as part of the Defense Advanced Research Projects Agency’s (DARPA) Robust Automatic Transcription of Speech (RATS) program and presents single, pairwise and all feature combinations.
Streaming End-to-end Speech Recognition for Mobile Devices
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This work describes its efforts at building an E2E speech recog-nizer using a recurrent neural network transducer and finds that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy.
Direct modeling of raw audio with DNNS for wake word detection
- Computer Science2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2017
This work develops a technique for training features directly from the single-channel speech waveform in order to improve wake word (WW) detection performance, and shows the effectiveness of this stage-wise training technique through a set of experiments on real beam-formed far-field data.
Speaker Diarization with LSTM
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This work combines LSTM-based d-vector audio embeddings with recent work in nonparametric clustering to obtain a state-of-the-art speaker diarization system that achieves a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while the model is trained with out- of-domain data from voice search logs.
Voice Activity Detection: Merging Source and Filter-based Information
- Computer ScienceIEEE Signal Processing Letters
- 2016
A mutual information-based assessment shows superior discrimination power for the source-related features, especially the proposed ones, and two strategies are proposed to merge source and filter information: feature and decision fusion.