Text-Dependent Audiovisual Synchrony Detection for Spoofing Detection in Mobile Person Recognition

  title={Text-Dependent Audiovisual Synchrony Detection for Spoofing Detection in Mobile Person Recognition},
  author={Amit Aides and Hagai Aronowitz},
Liveness detection is an important countermeasure against spoofing attacks on biometric authentication systems. In the context of audiovisual biometrics, synchrony detection is a proposed method for liveness confirmation. This paper presents a novel, text-dependent scheme for checking audiovisual synchronization in a video sequence. We present custom visual features learned using a unique deep learning framework and show that they outperform other commonly used visual features. We tested our… 

Figures and Tables from this paper

Robust Audiovisual Liveness Detection for Biometric Authentication Using Deep Joint Embedding and Dynamic Time Warping

This work proposes to measure liveness by comparing between alignments of audio and video to the a priori recorded sequence using dynamic time warping, providing improved performance compared to competing methods.

Spoofing detection via simultaneous verification of audio-visual synchronicity and transcription

This work uses coupled hidden Markov models (CHMMs) for a text-dependent spoofing detection and introduces new features that provide information about the transcriptions of the utterance and the synchronicity of both streams that leads to a more robust recognition.

Deep Siamese Architecture Based Replay Detection for Secure Voice Biometric

A novel approach to evaluate the similarities between pairs of speech samples to detect replayed speech based on a suitable embedding learned by deep Siamese architectures outperforms state-of-the-art systems when evaluated on the ASVspoof 2017 challenge corpus without relying on fusion with other sub-systems.

Audiovisual Synchrony Detection with Optimized Audio Features

Deep CCA (DCCA), a nonlinear extension of CCA, is adopted to enhance joint space modeling and indicates substantially enhanced audiovisual speech synchrony detection, with an equal error rate (EER) of 3.68%.

Speaker recognition using common passphrases in RedDots

    Hagai Aronowitz
    Computer Science
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
This paper reports work on the recently collected text dependent speaker recognition dataset named RedDots, with a focus on the common passphrase condition, and reports the use of bagging for improved accuracy and an analysis of system sensitivity to the duration between enrollment and testing (template aging).

Using audio-visual information to understand speaker activity: Tracking active speakers on and off screen

An extreme simple approach to generating (weak) speech clusters can be combined with strong visual signals to effectively associate faces and voices by aggregating statistics across a video by fusing information from the audio and visual signals.

Audio and visual modality combination in speech processing applications

This chapter focuses on AVASR while also addressing other related problems, namely audio-visual speech activity detection, diarization, and synchrony detection and rapid recent advances, leading to so-called "end-to-end" AVAsR systems.

Multimodal Transformer Distillation for Audio-Visual Synchronization

An MTDVocaLiST model is proposed, which is trained by the proposed multimodal Transformer distillation (MTD) loss, to deeply mimic the cross-attention distribution and value-relation in the Transformer of Voca LiST.

A speaker independent "liveness" test for audio-visual biometrics

This test ensures that biometric cues being acquired are actual measurements from a live person who is present at the time of capture, and uses the correlation that exists between the lip movements and the speech produced.

"liveness" Verification in Audio-video Authentication

This paper proposes to use combined acoustic and visual feature vectors to distinguish live synchronous audio-video recordings from replay attacks that use audio with a still photo. Equal error rates

Detecting audio-visual synchrony using deep neural networks

This paper addresses the problem of automatically detecting whether the audio and visual speech modalities in frontal pose videos are synchronous or not, and investigates the use of deep neural networks (DNNs) for this purpose.

Assessing face and speech consistency for monologue detection in video

The most successful and computationally cheapest scheme obtains an accuracy of 82% on the task of picking the "consistent" speaker from a set including three confusers, and a final experiment demonstrates the potential utility of the scheme for speaker localization in video.

Robust audio-visual speech synchrony detection by generalized bimodal linear prediction

This work builds on earlier work, extending the previously proposed time-evolution model of audio-visual features to include non-causal (future) feature information, which significantly improves robustness of the method to small timealignment errors between the audio and visual streams.

Audiovisual Speech Synchrony Measure: Application to Biometrics

The most common audio andVisual speech front-end processing, transformations performed on audio, visual, or joint audiovisual feature spaces, and the actual measure of correspondence between audio and visual speech are overviewed.

Look who's talking: speaker detection using video and audio correlation

    Ross CutlerL. Davis
    Computer Science
    2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532)
  • 2000
A method of automatically detecting a talking person using video and audio data from a single microphone using a time-delayed neural network and a spatio-temporal search for a speaking person is described.

New Developments in Voice Biometrics for User Authentication

This work investigates the use of state-of-the-art text-independent and text-dependent speaker verification technology for user authentication and shows how to adapt techniques such as joint factor analysis (JFA), Gaussian mixture models with nuisance attribute projection (GMM-NAP), and hidden Markov models with NAP to obtain improved results for new authentication scenarios and environments.

ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan

The task is to develop a bona fide spoofed classifier (spoofing countermeasure) for speech data to rank and analyse the results, and present a summary at an INTERSPEECH 2021 satellite workshop.