Generalization of Audio Deepfake Detection

  title={Generalization of Audio Deepfake Detection},
  author={Tianxiang Chen and Avrosh Kumar and Parav Nagarsheth and Ganesh Sivaraman and Elie el Khoury},
Audio Deepfakes, technically known as logical-access voice spoofing techniques, have become an increased threat on voice interfaces due to the recent breakthroughs in speech synthesis and voice conversion technologies. Effectively detecting these attacks is critical to many speech applications including automatic speaker verification systems. As new types of speech synthesis and voice conversion techniques are emerging rapidly, the generalization ability of spoofing countermeasures is becoming… 

Figures and Tables from this paper

Pindrop Labs' Submission to the ASVspoof 2021 Challenge

This work focuses on improving the generalization of the embedding extractor model and the backend classifier model, and uses log filter banks as the acoustic features in all the authors' systems.

One-Class Learning Towards Synthetic Voice Spoofing Detection

This work proposes an anti-spoofing system to detect unknown synthetic voice spoofing attacks (i.e., text-to-speech or voice conversion) using one-class learning, which achieves an equal error rate (EER) and outperforming all existing single systems.

UR Channel-Robust Synthetic Speech Detection System for ASVspoof 2021

UR-AIR system submission to the logical access (LA) and the speech deepfake (DF) tracks of the ASVspoof 2021 Challenge is presented and a channel-robust synthetic speech detection system is proposed for the challenge.

Audio Deepfake Detection System with Neural Stitching for ADD 2022

This paper describes the best system and methodology for ADD 2022: The First Audio Deep Synthesis Detection Challenge and outperforms all other systems with 10.1% equal error rate(EER) in Track 3.2.

One-class learning towards generalized voice spoofing detection

This work proposes an anti-spoofing system to detect unknown logical access attacks (i.e., synthetic speech) using one-class learning and injects an angular margin to separate the spoofing attacks in the embedding space.

A Study On Data Augmentation In Voice Anti-Spoofing

Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck

Evaluation on the ASVspoof 2019 logical access database shows that the proposed transfer learning scheme based on the wav2vec 2.0 pretrained model with variational information bottleneck (VIB) for speech anti-spoofing task improves the performance of distinguishing unseen spoofed and genuine speech, outperforming current state-of-the-art anti- Spoof systems.

Synthetic speech detection using meta-learning with prototypical loss

This work addresses the generalizability of spoofing detection by proposing prototypical loss under the meta-learning paradigm to mimic the unseen test scenario during training and demonstrates that the proposed single system without any data augmentation can achieve competitive performance to the recent best anti-spoofing systems on ASVspoof 2019 logical access (LA) task.

Fake Audio Detection in Resource-Constrained Settings Using Microfeatures

This is the first study analysing VOT and coarticulation as features for fake audio detection and suggests these microfeatures as standalone features for speaker-dependent forensics, voice-biometrics, and for rapid pre-screening of suspicious audios, and as additional features in bigger feature sets for computa-tionally intensive classifiers.

SASV Challenge 2022: A Spoofing Aware Speaker Verification Challenge Evaluation Plan

The ASVspoof challenge series focuses on the development of CMs for a fixed ASV system with a pre-determined operating point, and it is argued that better performance can be delivered when CM and ASV subsystems are both optimised.



Deep Feature Engineering for Noise Robust Spoofing Detection

This paper employs deep feedforward, recurrent, and convolutional neural networks to extract robust and discriminative deep features by using deep learning techniques for spoofing detection and introduces multicondition training, noise-aware training, and annealed dropout training to make neural networks more robust against noise and to avoid overfitting to specific spoofing attacks and noise types.

Long Range Acoustic and Deep Features Perspective on ASVspoof 2019

A comprehensive analysis on the nature of different kinds of spoofing attacks and system development is made and the use of deep features that enhances the discriminative ability between genuine and spoofed speech is investigated.

ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection

The 2019 database, protocols and challenge results are described, and major findings which demonstrate the real progress made in protecting against the threat of spoofing and fake audio are outlined.

A Spoofing Benchmark for the 2018 Voice Conversion Challenge: Leveraging from Spoofing Countermeasures for Speech Artifact Assessment

The preliminary findings suggest potential of CMs outside of their original application, as a supplemental optimization and benchmarking tool to enhance VC technology.

Toward Robust Audio Spoofing Detection: A Detailed Comparison of Traditional and Learned Features

This research examines robust audio features, both traditional and those learned through an autoencoder, which is generalizable to different types of replay spoofing, and base the system on a traditional Gaussian mixture model-universal background model (GMM-UBM).

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.

Detecting Converted Speech and Natural Speech for anti-Spoofing Attack in Speaker Recognition

Experiments show that the performance of the features derived from phase spectrum outperform the melfrequency cepstral coefficients (MFCCs) tremendously: even without converted speech for training, the equal error rate (EER) is reduced from 20.20% of MFCCs to 2.35%.

A comparison of features for synthetic speech detection

Comparative results indicate that features representing spectral information in high-frequency region, dynamic information of speech, and detailed information related to subband characteristics are considerably more useful in detecting synthetic speech detection task.

The SJTU Robust Anti-Spoofing System for the ASVspoof 2019 Challenge

The SJTU’s submitted antispoofing system shows consistent performance improvement over all types of spoofing attacks and Log-CQT features are developed in conjunction with multi-layer convolutional neural networks for robust performance across both subtasks.

A study on data augmentation of reverberant speech for robust speech recognition

It is found that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added, and the trained acoustic models not only perform well in the distant- talking scenario but also provide better results in the close-talking scenario.