A Multi-View Approach to Audio-Visual Speaker Verification
@article{Sari2021AMA, title={A Multi-View Approach to Audio-Visual Speaker Verification}, author={Leda Sari and Kritika Singh and Jiatong Zhou and Lorenzo Torresani and Nayan Singhal and Yatharth Saraf}, journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2021}, pages={6194-6198} }
Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the visual stream provides complementary information and can often be leveraged in conjunction with the acoustics of speech to improve verification performance. In this study, we explore audio-visual approaches to speaker verification, starting with standard fusion techniques to learn joint audio-visual (AV) embeddings, and then…
13 Citations
A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer
- Computer ScienceArXiv
- 2022
While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled…
Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT
- Computer ScienceINTERSPEECH
- 2022
Experimental results suggest that AV-HuBERT generalizes decently to speaker related downstream tasks, and shows that incorporating visual information, even just the lip area, greatly improves the performance and noise robustness, reducing EER by 38% in the clean condition and 75% in noisy conditions.
u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality
- Computer Science
- 2022
While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled…
AVA-AVD: Audio-visual Speaker Diarization in the Wild
- Computer ScienceACM Multimedia
- 2022
The AVA Audio-Visual Relation Network (AVR-Net) is designed which introduces a simple yet effective modality mask to capture discriminative information based on face visibility and shows that the method not only can outperform state-of-the-art methods but is more robust as varying the ratio of off-screen speakers.
Learning in Audio-visual Context: A Review, Analysis, and New Perspective
- Computer ScienceArXiv
- 2022
This survey reviews and outlooks the current audio-visual learning from different aspects and hopes it can provide researchers with a better understanding of this area.
Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs
- Computer ScienceArXiv
- 2022
This work proposes a multi-modal contrastive learning technique with novel sampling strategies that outperforms the state-of-the-art self-supervised learning methods by a large margin, and achieves comparable results with the supervised learning counterpart.
Audio-Visual Person-of-Interest DeepFake Detection
- Computer ScienceArXiv
- 2022
This work extracts high-level audio-visual biometric features which characterize the identity of a person, and uses them to create a person-of-interest (POI) deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
INVESTIGATING WAVEFORM AND SPECTROGRAM FEATURE FUSION FOR ACOUSTIC SCENE CLASSIFICATION Technical Report
- Computer Science
- 2021
A new model paradigm for acoustic scene classification is introduced by fusing features learned from Mel-spectrograms and the raw waveform from separate feature extraction branches, and it is shown that learned features of raw waveforms and Mel-Spectrograms are indeed complementary to each other and that there is a consistent classification performance improvement over models trained on Mel- Spectrograms alone.
Look longer to see better: Audio-visual event localization by exploiting long-term correlation
- Computer Science2022 International Joint Conference on Neural Networks (IJCNN)
- 2022
An audio-visual long-term correlation network to capture the longer correlation of audio and visual features, which is underused by existing methods is proposed and the results prove the superiority of the method over its counterparts.
Learning Branched Fusion and Orthogonal Projection for Face-Voice Association
- Computer ScienceArXiv
- 2022
This work proposes a light-weight, plug-and-play mech- anism that exploits the complementary cues in both modalities to form enriched fused embeddings and clusters them based on their identity labels via orthogonality constraints and reveals that the proposed formulation of supervision is more effective and efficient than the ones employed by the contemporary methods.
References
SHOWING 1-10 OF 25 REFERENCES
Disentangled Speech Embeddings Using Cross-Modal Self-Supervision
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
A self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video to tease apart the representations of linguistic content and speaker identity without access to manually annotated data is developed.
Multi-level Fusion of Audio and Visual Features for Speaker Identification
- PhysicsICB
- 2006
A new audio-visual correlative model (AVCM) based on DBN is proposed, which describes both the inter-correlations and loose timing synchronicity between the audio and video streams.
Deep Latent Space Learning for Cross-Modal Mapping of Audio and Visual Signals
- Computer Science2019 Digital Image Computing: Techniques and Applications (DICTA)
- 2019
A novel deep training algorithm which consists of a single stream network (SSNet) coupled with a novel loss function to learn a shared deep latent space representation of multimodal information and demonstrates the effectiveness of the technique for cross-modal biometric applications.
Noise-tolerant Audio-visual Online Person Verification Using an Attention-based Neural Network Fusion
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
An attention-based end-to-end neural network that learns multi-sensory association for the task of person verification using both speech and visual signals and demonstrates robustness over other unimodal methods.
Frequency and Temporal Convolutional Attention for Text-Independent Speaker Recognition
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
The proposed CNN front-end fitted with the proposed convolutional attention modules outperform the no-attention and spatial-CBAM baselines by a significant margin on the VoxCeleb benchmark, concluding that simultaneously modelling temporal and frequency attention translates to better real-world performance.
Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network
- Computer ScienceINTERSPEECH
- 2020
Experiments show that VFNet provides additional speaker discriminative information and achieves 16.54% equal error rate relative reduction over the score level fusion audio-visual baseline on evaluation set of 2019 NIST SRE.
Deep Neural Network Embeddings for Text-Independent Speaker Verification
- Computer ScienceINTERSPEECH
- 2017
It is found that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions, which are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.
VoxCeleb: A Large-Scale Speaker Identification Dataset
- Computer ScienceINTERSPEECH
- 2017
This paper proposes a fully automated pipeline based on computer vision techniques to create a large scale text-independent speaker identification dataset collected 'in the wild', and shows that a CNN based architecture obtains the best performance for both identification and verification.
VoxCeleb2: Deep Speaker Recognition
- Computer ScienceINTERSPEECH
- 2018
A very large-scale audio-visual speaker recognition dataset collected from open-source media is introduced and Convolutional Neural Network models and training strategies that can effectively recognise identities from voice under various conditions are developed and compared.
Training Spoken Language Understanding Systems with Non-Parallel Speech and Text
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This study investigates the use of non-parallel speech and text to improve the performance of dialog act recognition as an example SLU task and proposes a multiview architecture that can handle each modality separately.