A Novel Speech-Driven Lip-Sync Model with CNN and LSTM

  title={A Novel Speech-Driven Lip-Sync Model with CNN and LSTM},
  author={Xiaohong Li and Xiang Wang and Kai Wang and Shiguo Lian},
  journal={2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)},
  • Xiaohong LiXiang Wang Shiguo Lian
  • Published 23 October 2021
  • Computer Science
  • 2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)
Generating synchronized and natural lip movement with speech is one of the most important tasks in creating realistic virtual characters. In this paper, we present a combined deep neural network of one-dimensional convolutions and LSTM to generate vertex displacement of a 3D template face model from variable-length speech input. The motion of the lower part of the face, which is represented by the vertex movement of 3D lip shapes, is consistent with the input speech. In order to enhance the… 

Figures and Tables from this paper

Articulation GAN: Unsupervised modeling of articulatory learning

A new unsupervised generative model of speech production/synthesis that includes articulatory representations and thus more closely mimics human speech production is proposed and implications of articulatory representation for cognitive models of human language and speech technology in general are discussed.

Constructing marine expert management knowledge graph based on Trellisnet-CRF

A novel marine science domain-based knowledge graph framework that utilizes various entity information based on marine domain experts to enrich the semantic content of the knowledge graph is presented and a novel TrellisNet-CRF model is proposed.



VisemeNet: Audio-Driven Animator-Centric Speech Animation

A novel deep-learning based approach to producing animator-centric speech motion curves that drive a JALI or standard FACS-based production face-rig, directly from input audio, that integrates seamlessly into existing animation pipelines.

JALI: an animator-centric viseme model for expressive lip synchronization

A system that, given an input audio soundtrack and speech transcript, automatically generates expressive lip-synchronized facial animation that is amenable to further artistic refinement, and that is comparable with both performance capture and professional animator output is presented.

Audio- and Gaze-driven Facial Animation of Codec Avatars

This paper describes the first approach to animate Codec Avatars in real-time which could be deployed on commodity virtual reality hardware using audio and/or eye tracking and investigates a multimodal fusion approach that dynamically identifies which sensor encoding should animate which parts of the face at any time.

Talking heads synthesis from audio with deep neural networks

The method is proposed to use lower level audio features than phonemes and it enables to synthesize talking heads with expressions while existing researches which use phoneme as audio features only can synthesize neutral expression talking heads.

Deep Speech: Scaling up end-to-end speech recognition

Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.

OpenFace 2.0: Facial Behavior Analysis Toolkit

OpenFace 2.0 is an extension of OpenFace toolkit and is capable of more accurate facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation.

Learning a model of facial shape and expression from 4D scans

Faces Learned with an Articulated Model and Expressions is low-dimensional but more expressive than the FaceWarehouse model and the Basel Face Model and is compared to these models by fitting them to static 3D scans and 4D sequences using the same optimization method.

THCHS-30 : A Free Chinese Speech Corpus

This paper releases a free Chinese speech database THCHS-30 that can be used to build a full- edged Chinese speech recognition system, and reports the baseline system established with this database, including the performance under highly noisy conditions.

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.