Audio-Visual Person-of-Interest DeepFake Detection

  title={Audio-Visual Person-of-Interest DeepFake Detection},
  author={Davide Cozzolino and Matthias Nie{\ss}ner and Luisa Verdoliva},
Face manipulation technology is advancing very rapidly, and new methods are being proposed day by day. The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world. Our key insight is that each person has specific biometric characteristics that a synthetic generator cannot likely reproduce. Accordingly, we extract high-level audio-visual biometric features which characterize the identity of a person… 
1 Citations

Self-Supervised Video Forensics by Audio-Visual Anomaly Detection

An autoregressive model is trained to generate sequences of audio-visual features, using feature sets that capture the temporal synchronization between video frames and sound, and obtains strong performance on the task of detecting manipulated speech videos.



The DeepFake Detection Challenge Dataset

Although Deep fake detection is extremely difficult and still an unsolved problem, a Deepfake detection model trained only on the DFDC can generalize to real "in-the-wild" Deepfake videos, and such a model can be a valuable analysis tool when analyzing potentially Deepfaked videos.

FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset

A novel Audio-Video Deepfake dataset (FakeAVCeleb) is proposed that not only contains deepfake videos but respective synthesized cloned audios as well and proposes a novel multimodal detection method that detects deep fake videos and audios based on this dataset.

Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery Detection

Extensive experiments show that this simple approach significantly surpasses the state-of-the-art in terms of generalisation to unseen manipulations and robustness to perturbations, as well as shed light on the factors responsible for its performance.

VoxCeleb2: Deep Speaker Recognition

A very large-scale audio-visual speaker recognition dataset collected from open-source media is introduced and Convolutional Neural Network models and training strategies that can effectively recognise identities from voice under various conditions are developed and compared.

ID-Reveal: Identity-aware DeepFake Video Detection

ID-Reveal is introduced, a new approach that learns temporal facial features, specific of how a person moves while talking, by means of metric learning coupled with an adversarial training strategy, which improves generalization and is more robust to low-quality videos, that are usually spread over social networks.

Protecting Celebrities from DeepFake with Identity Consistency Transformer

It is shown that Identity Consistency Transformer exhibits superior generalization ability not only across different datasets but also across various types of image degradation forms found in real-world applications including deepfake videos.

Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

The proposed detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS), outperforms the state-of-the-art by up to 7%.

Disentangled Speech Embeddings Using Cross-Modal Self-Supervision

A self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video to tease apart the representations of linguistic content and speaker identity without access to manually annotated data is developed.

FSGAN: Subject Agnostic Face Swapping and Reenactment

A novel recurrent neural network (RNN)-based approach for face reenactment which adjusts for both pose and expression variations and can be applied to a single image or a video sequence and uses a novel Poisson blending loss which combines Poisson optimization with perceptual loss.

Fast Face-Swap Using Convolutional Neural Networks

A new loss function is devised that enables the network to produce highly photorealistic results by making face swap work in real-time with no input from the user.