Audio-to-Visual Speech Conversion Using Deep Neural Networks

  title={Audio-to-Visual Speech Conversion Using Deep Neural Networks},
  author={Sarah L. Taylor and Akihiro Kato and I. Matthews and Ben P. Milner},
We study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal. We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM… 

Figures and Tables from this paper

Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

This paper investigates adapting an automatic speech recognition (ASR) acoustic model (AM) for the visual speech synthesis problem and concludes that visualspeech synthesis can significantly benefit from the powerful representation of speech in the ASR acoustic models.

Audiovisual Speech Synthesis using Tacotron2

The end-to-end AVTacotron2 system is able to synthesize close to human-like audiovisual speech with mean opinion scores (MOS) of 4.1, which is the same MOS obtained on the ground truth generated from professionally recorded videos.

Modality Dropout for Improved Performance-driven Talking Faces

This work uses subjective testing to demonstrate the improvement of audiovisual-driven animation over the equivalent video-only approach, and the improvement in the animation of speech-related facial movements after introducing modality dropout.

Dense Convolutional Recurrent Neural Network for Generalized Speech Animation

  • Lei XiaoZengfu Wang
  • Computer Science
    2018 24th International Conference on Pattern Recognition (ICPR)
  • 2018
The approach learns a non-linear mapping from acoustic speech to multiple articulator movements in a unified framework to which feature extraction, context encoding and multi-parameter decoding are integrated, and has the capability of deploying on various character models.

Semi-supervised Cross-domain Visual Feature Learning for Audio-Visual Broadcast Speech Transcription

Experimental results suggest a CNN based AVSR system using the proposed semi-supervised cross-domain audio-to-visual feature generation technique outperformed the baseline audio only CNN ASR system by an average CER reduction of 6.8% relative.

The Effect of Real-Time Constraints on Automatic Speech Animation

This work considers asymmetric windows by investigating the extent to which decreasing the future context effects the quality of predicted animation using both deep neural networks (DNNs) and bi-directional LSTM recurrent neural Networks (BiLSTMs).

Exploiting Cross-Domain Visual Feature Generation for Disordered Speech Recognition

Experiments conducted on the UASpeech corpus suggest that the proposed cross-domain visual feature generation based AVSR system consistently outperformed the baseline ASR system andAVSR system using original visual features.

Audio-driven facial animation by joint end-to-end learning of pose and emotion

This work presents a machine learning technique for driving 3D facial animation by audio input in real time and with low latency, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone.

Emotion Dependent Domain Adaptation for Speech Driven Affective Facial Feature Synthesis

The proposed affective A2V system achieves significant MSE loss improvements in comparison to the recent literature and the resulting facial animations of the proposed system are preferred over the baseline animations in the subjective evaluations.



Audio-to-Visual Conversion Via HMM Inversion for Speech-Driven Facial Animation

Experimental results show that full covariance matrices are preferable since similar, to the case of using diagonal matrices, performance can be achieved using a less complex model.

An audio-visual corpus for speech perception and automatic speech recognition.

An audio-visual corpus that consists of high-quality audio and video recordings of 1000 sentences spoken by each of 34 talkers to support the use of common material in speech perception and automatic speech recognition studies.

Trainable videorealistic speech animation

  • T. EzzatG. GeigerT. Poggio
  • Computer Science
    Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings.
  • 2004
This work describes how to create with machine learning techniques a generative, videorealistic, and speech animation module that looks like a video camera recording of the subject.

Direct, modular and hybrid audio to visual speech conversion methods - a comparative study

A systematic comparative study of audio to visual speech conversion methods is described, and subjective opinion score evaluation tests prove the naturalness of the direct conversion is the best.

Audio/visual mapping with cross-modal hidden Markov models

This paper quantitatively compares three recently proposed cross-modal HMM methods, namely the remapping HMM (R-HMM), the least-mean-squared H MM (LMS-H MM), and HMM inversion (HMMI), and shows that HMMI provides the best performance, both on synthetic and experimental audio-visual data.

Dynamic units of visual speech

It is found that dynamic visemes are able to produce more accurate and visually pleasing speech animation given phonetically annotated audio, reducing the amount of time that an animator needs to spend manually refining the animation.

Real-time speech-driven face animation with expressions using neural networks

Experimental results show that the synthetic expressive talking face of the iFACE system is comparable with a real face in terms of the effectiveness of their influences on bimodal human emotion perception.

Mapping from Speech to Images Using Continuous State Space Models

A system that transforms speech waveforms to animated faces that relies on continuous state space models to perform the mapping and is able to construct an image sequence from an unknown noisy speech sequence even though the number of training examples are limited.

"Eigenlips" for robust speech recognition

  • C. BreglerY. Konig
  • Physics, Computer Science
    Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing
  • 1994
This study improves the performance of a hybrid connectionist speech recognition system by incorporating visual information about the corresponding lip movements by using a new visual front end, and an alternative architecture for combining the visual and acoustic information.

Audiovisual speech processing

  • Tsuhan Chen
  • Computer Science
    IEEE Signal Process. Mag.
  • 2001
Audiovisual speech processing results have shown that, with lip reading, it is possible to enhance the reliability of audio speech recognition, which may result in a computer that can truly understand the user via hand-free natural spoken language even in a very noisy environments.