Joint Learning of Speech-Driven Facial Motion with Bidirectional Long-Short Term Memory

@inproceedings{Sadoughi2017JointLO,
  title={Joint Learning of Speech-Driven Facial Motion with Bidirectional Long-Short Term Memory},
  author={Najmeh Sadoughi and Carlos Busso},
  booktitle={IVA},
  year={2017}
}
The face conveys a blend of verbal and nonverbal information playing an important role in daily interaction. While speech articulation mostly affects the orofacial areas, emotional behaviors are externalized across the entire face. Considering the relation between verbal and nonverbal behaviors is important to create naturalistic facial movements for conversational agents (CAs). Furthermore, facial muscles connect areas across the face, creating principled relationships and dependencies between… 
Expressive Speech-Driven Lip Movements with Multitask Learning
  • Najmeh Sadoughi, C. Busso
  • Psychology
    2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018)
  • 2018
TLDR
Two deep learning speech-driven structures to integrate speech articulation and emotional cues are provided, based on multitask learning (MTL) strategies, where related secondary tasks are jointly solved when synthesizing orofacial movements.
Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks
TLDR
A conditional generative adversarial network, called conditional sequential GAN (CSG), which learns the relationship between emotion, lexical content and lip movements in a principled manner, which shows significantly better results for this model compared with the CSG model when the target emotion is happiness.
Novel Realizations of Speech-Driven Head Movements with Generative Adversarial Networks
  • Najmeh Sadoughi, C. Busso
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
A conditional GAN with bidirectional long-short term memory (BLSTM), which is suitable for capturing the long- short term dependencies of time-continuous signals and compares this model with a dynamic Bayesian network (DBN) and BLSTM models optimized to reduce mean squared error (MSE) or to increase concordance correlation.
Predicting head pose from speech
TLDR
Algorithms for content driven speech animation; models that learn visual actions from data without semantic labelling, to predict realistic speech animation from recorded audio are developed.
Analyzing Input and Output Representations for Speech-Driven Gesture Generation
TLDR
A novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots, using a denoising autoencoder neural network and a novel encoder network.
The Effect of Real-Time Constraints on Automatic Speech Animation
TLDR
This work considers asymmetric windows by investigating the extent to which decreasing the future context effects the quality of predicted animation using both deep neural networks (DNNs) and bi-directional LSTM recurrent neural Networks (BiLSTMs).
Speech Communication Speech-Driven Animation with Meaningful Behaviors
TLDR
The study proposes a DBN structure and a training approach that models the cause-effect relationship between the constraint and the gestures, and captures the differences in the behaviors across constraints by enforcing sparse transitions between shared and exclusive states per constraint.
Data-driven Gaze Animation using Recurrent Neural Networks
TLDR
This approach is the first one to show the feasibility of gaze motions using deep neural networks and has better perceived naturalness compared to the procedural gaze animation system of a well-known game company.
NADiA - Towards Neural Network Driven Virtual Human Conversation Agents
TLDR
The motivation and architecture of NADiA - Neurally Animated Dialog Agent - which leverages both the user's verbal input and facial expressions for multi-modal conversation is described.
...
1
2
...

References

SHOWING 1-10 OF 34 REFERENCES
Generating Human-Like Behaviors Using Joint, Speech-Driven Models for Conversational Agents
TLDR
This paper focuses on building a speech-driven facial animation framework to generate natural head and eyebrow motions, and proposes three dynamic Bayesian networks (DBNs), which make different assumptions about the coupling between speech, eyebrow and head motion.
Interrelation Between Speech and Facial Gestures in Emotional Utterances: A Single Subject Study
TLDR
The results suggest that emotional content affect the relationship between facial gestures and speech, and principal component analysis (PCA) shows that the audiovisual mapping parameters are grouped in a smaller subspace, which suggests that there is an emotion-dependent structure that is preserved from across sentences.
Modeling Multimodal Behaviors from Speech Prosody
TLDR
A fully parameterized Hidden Markov Model is proposed first to capture the tight relationship between speech and facial movement of a human face extracted from a video corpus and then to drive automatically virtual agent’s behaviors from speech signals.
Rigid Head Motion in Expressive Speech Animation: Analysis and Synthesis
TLDR
The results suggest that appropriate head motion not only significantly improves the naturalness of the animation but can also be used to enhance the emotional content of theAnimation to effectively engage the users.
Interplay between linguistic and affective goals in facial expression during emotional utterances
Communicative goals are simultaneously expressed through gestures and speech to convey messages enriched with valuable verbal and non-verbal clues. This paper analyzes and quantifies how linguistic
Expressive Speech Driven Talking Avatar Synthesis with DBLSTM Using Limited Amount of Emotional Bimodal Data
TLDR
Experimental results indicate that the method by concatenating neutral facial features with emotional acoustic features as the input of DBLSTM model achieves the best performance in both objective and subjective evaluations.
Expressive speech-driven facial animation
TLDR
This work derives a generative model of expressive facial motion that incorporates emotion control, while maintaining accurate lip-synching, from a database of speech-related high-fidelity facial motions.
Feature and model level compensation of lexical content for facial emotion recognition
  • Soroosh Mariooryad, C. Busso
  • Psychology
    2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG)
  • 2013
TLDR
The emotion recognition experiments on the IEMOCAP corpus validate the effectiveness of the proposed feature and model level compensation approaches both at the viseme and utterance levels.
HMM-based synthesis of emotional facial expressions during speech in synthetic talking heads
TLDR
Evaluation of the experimental results shows that HMMs for emotional facial expressions synthesis have some limitations but are suitable to make a synthetic Talking Head more expressive and realistic.
A deep bidirectional LSTM approach for video-realistic talking head
TLDR
Experimental results show that the proposed DBLSTM approach outperforms the existing HMM-based approach in both objective and subjective evaluations.
...
1
2
3
4
...