A Multi-View Approach to Audio-Visual Speaker Verification

  title={A Multi-View Approach to Audio-Visual Speaker Verification},
  author={Leda Sari and Kritika Singh and Jiatong Zhou and Lorenzo Torresani and Nayan Singhal and Yatharth Saraf},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Leda SariKritika Singh Yatharth Saraf
  • Published 11 February 2021
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the visual stream provides complementary information and can often be leveraged in conjunction with the acoustics of speech to improve verification performance. In this study, we explore audio-visual approaches to speaker verification, starting with standard fusion techniques to learn joint audio-visual (AV) embeddings, and then… 

Figures and Tables from this paper

A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer

While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled

Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

Experimental results suggest that AV-HuBERT generalizes decently to speaker related downstream tasks, and shows that incorporating visual information, even just the lip area, greatly improves the performance and noise robustness, reducing EER by 38% in the clean condition and 75% in noisy conditions.

u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality

While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled

AVA-AVD: Audio-visual Speaker Diarization in the Wild

The AVA Audio-Visual Relation Network (AVR-Net) is designed which introduces a simple yet effective modality mask to capture discriminative information based on face visibility and shows that the method not only can outperform state-of-the-art methods but is more robust as varying the ratio of off-screen speakers.

Learning in Audio-visual Context: A Review, Analysis, and New Perspective

This survey reviews and outlooks the current audio-visual learning from different aspects and hopes it can provide researchers with a better understanding of this area.

Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

This work proposes a multi-modal contrastive learning technique with novel sampling strategies that outperforms the state-of-the-art self-supervised learning methods by a large margin, and achieves comparable results with the supervised learning counterpart.

Audio-Visual Person-of-Interest DeepFake Detection

This work extracts high-level audio-visual biometric features which characterize the identity of a person, and uses them to create a person-of-interest (POI) deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.


A new model paradigm for acoustic scene classification is introduced by fusing features learned from Mel-spectrograms and the raw waveform from separate feature extraction branches, and it is shown that learned features of raw waveforms and Mel-Spectrograms are indeed complementary to each other and that there is a consistent classification performance improvement over models trained on Mel- Spectrograms alone.

Look longer to see better: Audio-visual event localization by exploiting long-term correlation

An audio-visual long-term correlation network to capture the longer correlation of audio and visual features, which is underused by existing methods is proposed and the results prove the superiority of the method over its counterparts.

Learning Branched Fusion and Orthogonal Projection for Face-Voice Association

This work proposes a light-weight, plug-and-play mech- anism that exploits the complementary cues in both modalities to form enriched fused embeddings and clusters them based on their identity labels via orthogonality constraints and reveals that the proposed formulation of supervision is more effective and efficient than the ones employed by the contemporary methods.



Disentangled Speech Embeddings Using Cross-Modal Self-Supervision

A self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video to tease apart the representations of linguistic content and speaker identity without access to manually annotated data is developed.

Multi-level Fusion of Audio and Visual Features for Speaker Identification

A new audio-visual correlative model (AVCM) based on DBN is proposed, which describes both the inter-correlations and loose timing synchronicity between the audio and video streams.

Deep Latent Space Learning for Cross-Modal Mapping of Audio and Visual Signals

A novel deep training algorithm which consists of a single stream network (SSNet) coupled with a novel loss function to learn a shared deep latent space representation of multimodal information and demonstrates the effectiveness of the technique for cross-modal biometric applications.

Noise-tolerant Audio-visual Online Person Verification Using an Attention-based Neural Network Fusion

An attention-based end-to-end neural network that learns multi-sensory association for the task of person verification using both speech and visual signals and demonstrates robustness over other unimodal methods.

Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition

  • Sarthak YadavA. Rai
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
The proposed CNN front-end fitted with the proposed convolutional attention modules outperform the no-attention and spatial-CBAM baselines by a significant margin on the VoxCeleb benchmark, concluding that simultaneously modelling temporal and frequency attention translates to better real-world performance.

Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network

Experiments show that VFNet provides additional speaker discriminative information and achieves 16.54% equal error rate relative reduction over the score level fusion audio-visual baseline on evaluation set of 2019 NIST SRE.

Deep Neural Network Embeddings for Text-Independent Speaker Verification

It is found that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions, which are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.

VoxCeleb: A Large-Scale Speaker Identification Dataset

This paper proposes a fully automated pipeline based on computer vision techniques to create a large scale text-independent speaker identification dataset collected 'in the wild', and shows that a CNN based architecture obtains the best performance for both identification and verification.

VoxCeleb2: Deep Speaker Recognition

A very large-scale audio-visual speaker recognition dataset collected from open-source media is introduced and Convolutional Neural Network models and training strategies that can effectively recognise identities from voice under various conditions are developed and compared.

Training Spoken Language Understanding Systems with Non-Parallel Speech and Text

This study investigates the use of non-parallel speech and text to improve the performance of dialog act recognition as an example SLU task and proposes a multiview architecture that can handle each modality separately.