Deep Latent Space Learning for Cross-Modal Mapping of Audio and Visual Signals

@article{Nawaz2019DeepLS,
  title={Deep Latent Space Learning for Cross-Modal Mapping of Audio and Visual Signals},
  author={Shah Nawaz and Muhammad Kamran Janjua and Ignazio Gallo and A. Mahmood and Alessandro Calefati},
  journal={2019 Digital Image Computing: Techniques and Applications (DICTA)},
  year={2019},
  pages={1-7}
}
We propose a novel deep training algorithm for joint representation of audio and visual information which consists of a single stream network (SSNet) coupled with a novel loss function to learn a shared deep latent space representation of multimodal information. The proposed framework characterizes the shared latent space by leveraging the class centers which helps to eliminate the need of pairwise or triplet supervision. We quantitatively and qualitatively evaluate the proposed approach on… 

Figures and Tables from this paper

Adversarial-Metric Learning for Audio-Visual Cross-Modal Matching
TLDR
A novel Adversarial-Metric Learning (AML) model for audio-visual matching that generates a modality-independent representation for each person in each modality via adversarial learning, while simultaneously learns a robust similarity measure for cross-modality matching via metric learning.
A Multi-View Approach to Audio-Visual Speaker Verification
TLDR
This study investigates unimodal and concatenation based AV fusion and reports the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset using the best system, and introduces a multi-view model which uses a shared classifier to map audio and video into the same space.
Fusion and Orthogonal Projection for Improved Face-Voice Association
TLDR
This work hypothesize that enriched feature representation coupled with an effective yet efficient supervision is necessary in realizing a discriminative joint embedding space for improved face-voice association and proposes a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings and clusters them based on their identity labels via orthogonality constraints.
Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network
TLDR
Experiments show that VFNet provides additional speaker discriminative information and achieves 16.54% equal error rate relative reduction over the score level fusion audio-visual baseline on evaluation set of 2019 NIST SRE.
Cross-modal Speaker Verification and Recognition: A Multilingual Perspective
  • M. S. Saeed, S. Nawaz, A. D. Bue
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
  • 2021
TLDR
A challenging task in establishing association between faces and voices across multiple languages spoken by the same set of persons is introduced to answer two closely related questions: is face-voice association language independent and can a speaker be recognised irrespective of the spoken language.

References

SHOWING 1-10 OF 32 REFERENCES
3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition
TLDR
This paper proposes the use of a coupled 3D convolutional neural network (3D CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio–visual streams using the learned multimodal features.
Cross-Modal Scene Networks
TLDR
The experiments suggest that the scene representation can help transfer representations across modalities for retrieval and the visualizations suggest that units emerge in the shared representation that tend to activate on consistent concepts independently of the modality.
Disjoint Mapping Network for Cross-modal Matching of Voices and Faces
TLDR
It is shown empirically that DIMNet is able to achieve better performance than other current methods, with the additional benefits of being conceptually simpler and less data-intensive.
Learnable PINs: Cross-Modal Embeddings for Person Identity
TLDR
A curriculum learning schedule for hard negative mining targeted to this task, that is essential for learning to proceed successfully, is developed and an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas is shown.
Git Loss for Deep Face Recognition
TLDR
This work introduces a joint supervision signal, Git loss, which leverages on softmax and center loss functions to enhance the discriminative capability of deep features in CNNs and achieves state-of-the-art accuracy on two major face recognition benchmark datasets.
A Discriminative Feature Learning Approach for Deep Face Recognition
TLDR
This paper proposes a new supervision signal, called center loss, for face recognition task, which simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers.
Deep Neural Network Embeddings for Text-Independent Speaker Verification
TLDR
It is found that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions, which are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.
Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching
TLDR
This paper introduces a seemingly impossible task: given only an audio clip of someone speaking, decide which of two face images is the speaker and shows that a CNN can indeed be trained to solve this task in both the static and dynamic scenarios and is even well above chance on 10-way classification of the face given the voice.
SoundNet: Learning Sound Representations from Unlabeled Video
TLDR
This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.
Look, Listen and Learn
TLDR
There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this.
...
1
2
3
4
...