Text-Dependent Audiovisual Synchrony Detection for Spoofing Detection in Mobile Person Recognition
@inproceedings{Aides2016TextDependentAS, title={Text-Dependent Audiovisual Synchrony Detection for Spoofing Detection in Mobile Person Recognition}, author={Amit Aides and Hagai Aronowitz}, booktitle={Interspeech}, year={2016} }
Liveness detection is an important countermeasure against spoofing attacks on biometric authentication systems. In the context of audiovisual biometrics, synchrony detection is a proposed method for liveness confirmation. This paper presents a novel, text-dependent scheme for checking audiovisual synchronization in a video sequence. We present custom visual features learned using a unique deep learning framework and show that they outperform other commonly used visual features. We tested our…
Figures and Tables from this paper
8 Citations
Robust Audiovisual Liveness Detection for Biometric Authentication Using Deep Joint Embedding and Dynamic Time Warping
- 2018
Computer Science
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
This work proposes to measure liveness by comparing between alignments of audio and video to the a priori recorded sequence using dynamic time warping, providing improved performance compared to competing methods.
Spoofing detection via simultaneous verification of audio-visual synchronicity and transcription
- 2017
Computer Science
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
This work uses coupled hidden Markov models (CHMMs) for a text-dependent spoofing detection and introduces new features that provide information about the transcriptions of the utterance and the synchronicity of both streams that leads to a more robust recognition.
Deep Siamese Architecture Based Replay Detection for Secure Voice Biometric
- 2018
Computer Science
INTERSPEECH
A novel approach to evaluate the similarities between pairs of speech samples to detect replayed speech based on a suitable embedding learned by deep Siamese architectures outperforms state-of-the-art systems when evaluated on the ASVspoof 2017 challenge corpus without relying on fusion with other sub-systems.
Audiovisual Synchrony Detection with Optimized Audio Features
- 2018
Computer Science
2018 IEEE 3rd International Conference on Signal and Image Processing (ICSIP)
Deep CCA (DCCA), a nonlinear extension of CCA, is adopted to enhance joint space modeling and indicates substantially enhanced audiovisual speech synchrony detection, with an equal error rate (EER) of 3.68%.
Speaker recognition using common passphrases in RedDots
- 2017
Computer Science
2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
This paper reports work on the recently collected text dependent speaker recognition dataset named RedDots, with a focus on the common passphrase condition, and reports the use of bagging for improved accuracy and an analysis of system sensitivity to the duration between enrollment and testing (template aging).
Using audio-visual information to understand speaker activity: Tracking active speakers on and off screen
- 2018
Computer Science
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
An extreme simple approach to generating (weak) speech clusters can be combined with strong visual signals to effectively associate faces and voices by aggregating statistics across a video by fusing information from the audio and visual signals.
Audio and visual modality combination in speech processing applications
- 2017
Computer Science
The Handbook of Multimodal-Multisensor Interfaces, Volume 1
This chapter focuses on AVASR while also addressing other related problems, namely audio-visual speech activity detection, diarization, and synchrony detection and rapid recent advances, leading to so-called "end-to-end" AVAsR systems.
Multimodal Transformer Distillation for Audio-Visual Synchronization
- 2022
Computer Science
ArXiv
An MTDVocaLiST model is proposed, which is trained by the proposed multimodal Transformer distillation (MTD) loss, to deeply mimic the cross-attention distribution and value-relation in the Transformer of Voca LiST.
29 References
A speaker independent "liveness" test for audio-visual biometrics
- 2005
Computer Science
INTERSPEECH
This test ensures that biometric cues being acquired are actual measurements from a live person who is present at the time of capture, and uses the correlation that exists between the lip movements and the speech produced.
"liveness" Verification in Audio-video Authentication
- 2004
Computer Science
INTERSPEECH
This paper proposes to use combined acoustic and visual feature vectors to distinguish live synchronous audio-video recordings from replay attacks that use audio with a still photo. Equal error rates…
Detecting audio-visual synchrony using deep neural networks
- 2015
Computer Science
INTERSPEECH
This paper addresses the problem of automatically detecting whether the audio and visual speech modalities in frontal pose videos are synchronous or not, and investigates the use of deep neural networks (DNNs) for this purpose.
Assessing face and speech consistency for monologue detection in video
- 2002
Computer Science
MULTIMEDIA '02
The most successful and computationally cheapest scheme obtains an accuracy of 82% on the task of picking the "consistent" speaker from a set including three confusers, and a final experiment demonstrates the potential utility of the scheme for speaker localization in video.
Robust audio-visual speech synchrony detection by generalized bimodal linear prediction
- 2009
Computer Science
INTERSPEECH
This work builds on earlier work, extending the previously proposed time-evolution model of audio-visual features to include non-causal (future) feature information, which significantly improves robustness of the method to small timealignment errors between the audio and visual streams.
Audiovisual Speech Synchrony Measure: Application to Biometrics
- 2007
Computer Science
EURASIP J. Adv. Signal Process.
The most common audio andVisual speech front-end processing, transformations performed on audio, visual, or joint audiovisual feature spaces, and the actual measure of correspondence between audio and visual speech are overviewed.
Look who's talking: speaker detection using video and audio correlation
- 2000
Computer Science
2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532)
A method of automatically detecting a talking person using video and audio data from a single microphone using a time-delayed neural network and a spatio-temporal search for a speaking person is described.
New Developments in Voice Biometrics for User Authentication
- 2011
Computer Science
INTERSPEECH
This work investigates the use of state-of-the-art text-independent and text-dependent speaker verification technology for user authentication and shows how to adapt techniques such as joint factor analysis (JFA), Gaussian mixture models with nuisance attribute projection (GMM-NAP), and hidden Markov models with NAP to obtain improved results for new authentication scenarios and environments.
Spoofing and countermeasures for speaker verification: A survey
- 2015
Computer Science
Speech Commun.
ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan
- 2021
Computer Science
ArXiv
The task is to develop a bona fide spoofed classifier (spoofing countermeasure) for speech data to rank and analyse the results, and present a summary at an INTERSPEECH 2021 satellite workshop.