Deep Latent Space Learning for Cross-Modal Mapping of Audio and Visual Signals

  title={Deep Latent Space Learning for Cross-Modal Mapping of Audio and Visual Signals},
  author={Shah Nawaz and Muhammad Kamran Janjua and I. Gallo and A. Mahmood and Alessandro Calefati},
  journal={2019 Digital Image Computing: Techniques and Applications (DICTA)},
  • Shah Nawaz, Muhammad Kamran Janjua, +2 authors Alessandro Calefati
  • Published 2019
  • Computer Science, Engineering
  • 2019 Digital Image Computing: Techniques and Applications (DICTA)
  • We propose a novel deep training algorithm for joint representation of audio and visual information which consists of a single stream network (SSNet) coupled with a novel loss function to learn a shared deep latent space representation of multimodal information. The proposed framework characterizes the shared latent space by leveraging the class centers which helps to eliminate the need of pairwise or triplet supervision. We quantitatively and qualitatively evaluate the proposed approach on… CONTINUE READING
    3 Citations

    Figures, Tables, and Topics from this paper

    A Multi-View Approach To Audio-Visual Speaker Verification
    • PDF
    Audio-Visual Speaker Recognition with a Cross-Modal Discriminative Network
    • 3
    • PDF
    Cross-modal Speaker Verification and Recognition: A Multilingual Perspective
    • PDF


    3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition
    • 56
    • PDF
    Cross-Modal Scene Networks
    • 68
    • PDF
    Disjoint Mapping Network for Cross-modal Matching of Voices and Faces
    • 20
    • PDF
    Learnable PINs: Cross-Modal Embeddings for Person Identity
    • 48
    • Highly Influential
    • PDF
    Git Loss for Deep Face Recognition
    • 14
    • PDF
    A Discriminative Feature Learning Approach for Deep Face Recognition
    • 1,785
    • PDF
    Deep Neural Network Embeddings for Text-Independent Speaker Verification
    • 380
    • PDF
    Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching
    • 97
    • Highly Influential
    • PDF
    SoundNet: Learning Sound Representations from Unlabeled Video
    • 519
    • PDF
    Look, Listen and Learn
    • 297
    • PDF