Personal VAD: Speaker-Conditioned Voice Activity Detection

@inproceedings{Ding2019PersonalVS,
  title={Personal VAD: Speaker-Conditioned Voice Activity Detection},
  author={Shaojin Ding and Quan Wang and Shuo-yiin Chang and Li Wan and Ignacio Lopez-Moreno},
  booktitle={The Speaker and Language Recognition Workshop},
  year={2019}
}
In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption, especially in scenarios where a keyword detector is unpreferable. We achieve this by training a VAD-alike neural network that is conditioned on the target speaker… 

Figures and Tables from this paper

Speaker Activity Driven Neural Speech Extraction

It is shown that this simple yet practical approach can successfully extract speakers after diarization, which results in improved ASR performance, especially in high overlapping conditions, with a relative word error rate reduction of up to 25%.

Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

A novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame, outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.

Enrollment-less training for personalized voice activity detection

A novel personalized voice activity detection (PVAD) learning method that does not require enrollment data during training, called enrollment-less training, which enables PVAD training without enrollment speech.

Target-Speaker Voice Activity Detection with Improved i-Vector Estimation for Unknown Number of Speaker

This paper extends TS-VAD to speaker diarization with unknown numbers of speakers, and proposes a fusion-based method to combine frame-level decisions from the systems for an improved initialization.

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

This work presents Personal VAD 2.0, a personalized voice activity detector that detects the voice activity of a target speaker, as part of a streaming on-device ASR system.

Multi-User Voicefilter-Lite via Attentive Speaker Embedding

The experiments show that, with up to four enrolled users, multi-user VoiceFilter-Lite is able to significantly reduce speech recognition and speaker verification errors when there is overlapping speech, without affecting performance under other acoustic conditions.

VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition

This work introduces VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system, and shows that such a model can be quantized as a 8-bit integer model and run in realtime.

Voice Activity Detection in the Wild: A Data-Driven Approach Using Teacher-Student Training

This work proposes a data-driven teacher-student approach for VAD, which utilizes vast and unconstrained audio data for training, enabling the utilization of any real-world, potentially noisy dataset.

Polynomial Eigenvalue Decomposition-Based Target Speaker Voice Activity Detection in the Presence of Competing Talkers

A polynomial eigenvalue decomposition-based target-speaker VAD algorithm to detect unseen target speakers in the presence of competing talkers and is consistently among the best in F1 and balanced accuracy scores over the investigated range of signal to interference ratio (SIR).

Sparsely Overlapped Speech Training in the Time Domain: Joint Learning of Target Speech Separation and Personal VAD Benefits

The weighted SI -SNR loss is proposed, together with the joint learning of target speech separation and personal VAD, which imposes a weight factor that is proportional to the target speaker's duration and returns zero when the target Speaker is absent.

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

Temporal Modeling Using Dilated Convolution and Gating for Voice-Activity-Detection

This paper proposes an alternative architecture that does not suffer from saturation problems by modeling temporal variations through a stateless dilated convolution neural network (CNN), which differs from conventional CNNs in three respects: it uses dilated causal convolution, gated activations and residual connections.

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

A novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker, by training two separate neural networks.

Sample Efficient Adaptive Text-to-Speech

Three strategies are introduced and benchmark three strategies at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.

Deep Speaker: an End-to-End Neural Speaker Embedding System

Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline.

All for one: feature combination for highly channel-degraded speech activity detection

This paper presents a feature combination approach to improve SAD on highly channel degraded speech as part of the Defense Advanced Research Projects Agency’s (DARPA) Robust Automatic Transcription of Speech (RATS) program and presents single, pairwise and all feature combinations.

Streaming End-to-end Speech Recognition for Mobile Devices

This work describes its efforts at building an E2E speech recog-nizer using a recurrent neural network transducer and finds that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy.

Direct modeling of raw audio with DNNS for wake word detection

This work develops a technique for training features directly from the single-channel speech waveform in order to improve wake word (WW) detection performance, and shows the effectiveness of this stage-wise training technique through a set of experiments on real beam-formed far-field data.

Speaker Diarization with LSTM

This work combines LSTM-based d-vector audio embeddings with recent work in nonparametric clustering to obtain a state-of-the-art speaker diarization system that achieves a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while the model is trained with out- of-domain data from voice search logs.

Voice Activity Detection: Merging Source and Filter-based Information

A mutual information-based assessment shows superior discrimination power for the source-related features, especially the proposed ones, and two strategies are proposed to merge source and filter information: feature and decision fusion.