Corpus ID: 236428913

Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations

  title={Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations},
  author={Se-Yun Um and Jihyun Kim and Jihyun Lee and Sangshin Oh and Kyungguen Byun and Hong-Goo Kang},
  • Se-Yun Um, Jihyun Kim, +3 authors Hong-Goo Kang
  • Published 2021
  • Computer Science, Engineering
  • ArXiv
In this paper, we propose an effective method to synthesize speaker-specific speech waveforms by conditioning on videos of an individual’s face. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are… Expand

Figures and Tables from this paper


Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image
Experimental results of matching and naturalness tests demonstrate that synthetic speech generated with the face-derived embedding vector is comparable to one with the speech-derivedembedding vector. Expand
Face Landmark-based Speaker-independent Audio-visual Speech Enhancement in Multi-talker Environments
These proposed models are the first models trained and evaluated on the limited size GRID and TCD-TIMIT datasets, that achieve speaker-independent speech enhancement in a multi-talker setting. Expand
Multimodal Target Speech Separation with Voice and Face References
The experimental results show that a pre-enrolled face image is able to benefit separating expected speech signals and it is shown that further improvement can be achieved when combing both face and voice embeddings. Expand
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation. Expand
Look, Listen and Learn - A Multimodal LSTM for Speaker Identification
A novel multimodal Long Short-Term Memory (LSTM) architecture which seamlessly unifies both visual and auditory modalities from the beginning of each sequence input is described which outperforms the state-of-the-art systems in speaker identification with lower false alarm rate and higher recognition accuracy. Expand
Face-Voice Matching using Cross-modal Embeddings
A face-voice matching model that learns cross-modal embeddings between face images and voice characteristics is proposed that achieves results very similar to human performance reported in cognitive science studies. Expand
Speech2Face: Learning the Face Behind a Voice
This paper designs and trains a deep neural network to perform the task of reconstructing a facial image of a person from a short audio recording of that person speaking, and evaluates and numerically quantify how these Speech2Face reconstructions, obtained directly from audio, resemble the true face images of the speakers. Expand
Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis
This paper proposes an approach to model multiple speakers TTS with a general DNN, where the same hidden layers are shared among different speakers while the output layers are composed of speaker-dependent nodes explaining the target of each speaker. Expand
Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching
This paper introduces a seemingly impossible task: given only an audio clip of someone speaking, decide which of two face images is the speaker and shows that a CNN can indeed be trained to solve this task in both the static and dynamic scenarios and is even well above chance on 10-way classification of the face given the voice. Expand
Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation
A cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams to overcome the frame discontinuity problem between two modalities due to transmission delay mismatch or jitter. Expand