Prediction of head motion from speech waveforms with a canonical-correlation-constrained autoencoder

@article{Lu2020PredictionOH,
  title={Prediction of head motion from speech waveforms with a canonical-correlation-constrained autoencoder},
  author={JinHong Lu and Hiroshi Shimodaira},
  journal={ArXiv},
  year={2020},
  volume={abs/2002.01869}
}
This study investigates the direct use of speech waveforms to predict head motion for speech-driven head-motion synthesis, whereas the use of spectral features such as MFCC as basic input features together with additional features such as energy and F0 is common in the literature. We claim that, rather than combining different features that originate from waveforms, it is more effective to use waveforms directly predicting corresponding head motion. The challenge with the waveform-based… 

Figures and Tables from this paper

Towards Multimodal Human-Like Characteristics and Expressive Visual Prosody in Virtual Agents
TLDR
This paper outlines the author's PhD research plan that aims to develop convincing expressive and natural behavior in ECAs and to explore and model the mechanisms that govern human-agent multimodal interaction.
Double-DCCCAE: Estimation of Body Gestures From Speech Waveform
TLDR
A system, double deep canonical-correlation-constrained autoencoder (D-DCCCAE), which encodes each of speech and motion segments into fixed-length embedded features that are well correlated with the segments of the other modality is proposed.
Double-DCCCAE: Estimation of Sequential Body Motion Using Wave-Form - AlltheSmooth
TLDR
A frame-based system that estimates the motion in a sequential manner, double deep canonical correlation constrained autoencoder (Double-DCCCAE), which encodes sequential features (speech/motion) into frame- based embedded features with error and canonical correlation analysis (CCA) loss is proposed.
Realistic talking face animation with speech-induced head motion
TLDR
This paper proposes a method for generating speech-driven realistic talking face animation which has speech-coherent head motions with accurate lip sync, natural eye-blink, and high fidelity texture and proposes an attention-based GAN network to identify the highly correlated audio with the speaker's head motion and learn the relationship between the prosodic information of the speech and the corresponding head motions.
MULTIMODAL GENERATION OF UPPER-FACIAL AND HEAD GESTURES WITH A TRANSFORMER NETWORK USING SPEECH AND TEXT
TLDR
This work proposes a model that generates gestures based on multimodal input features: the first modality is text, and the second one is speech prosody, and makes use of Transformers and Convolutions to map the multi-modality features that correspond to an utterance to continuous eyebrows and head gestures.
Transformer Network for Semantically-Aware and Speech-Driven Upper-Face Generation
TLDR
This work proposes a semantically-aware speech driven model to generate expressive and natural upper-facial and head motion for Embodied Conversational Agents (ECA) and conducts subjective and objective evaluations to validate the approach and compare it with state of the art.

References

SHOWING 1-10 OF 34 REFERENCES
Head motion synthesis from speech using deep neural networks
TLDR
A deep neural network (DNN) approach for head motion synthesis, which can automatically predict head movement of a speaker from his/her speech, is presented and promising results in speech-to-head-motion prediction can be used in talking avatar animation.
A neural network based post-filter for speech-driven head motion synthesis
TLDR
This work proposes to employ a neural network trained in a way that it is capable of reconstructing the head motions, in order to overcome this limitation of deep neural networks in predicting human motion.
Speech driven talking head from estimated articulatory features
In this paper, we present a talking head in which the lips and head motion are controlled using articulatory movements estimated from speech. A phone-size HMM-based inversion mapping is employed and
Acoustic Modeling of Speech Waveform Based on Multi-Resolution, Neural Network Signal Processing
TLDR
This paper extends the waveform based NN model by a second level of time-convolutional element, which generalizes the envelope extraction block, and allows the model to learn multi-resolutional representations.
Natural head motion synthesis driven by acoustic prosodic features
TLDR
This paper presents a novel data‐driven approach to synthesize appropriate head motion by sampling from trained hidden markov models (HMMs) and shows that synthesized head motions follow the temporal dynamic behavior of real human subjects.
Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis
TLDR
This work presents a novel approach which makes use of DNNs with stacked bottleneck features combined with a BLSTM architecture to model context and expressive variability and results from a subjective evaluation show a significant improvement of the bottleneck architecture over feed-forward DNN's.
Audio-visual synthesis of talking faces from speech production correlates
TLDR
It appears that realistic talking heads can be synthesized from the acoustics alone, and better estimations of face motion from either speech acoustic parameters or muscle EMG activity can be obtained using a simple nonlinear (neural network) architecture.
Novel Realizations of Speech-Driven Head Movements with Generative Adversarial Networks
  • Najmeh SadoughiC. Busso
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
A conditional GAN with bidirectional long-short term memory (BLSTM), which is suitable for capturing the long- short term dependencies of time-continuous signals and compares this model with a dynamic Bayesian network (DBN) and BLSTM models optimized to reduce mean squared error (MSE) or to increase concordance correlation.
Acoustic Modelling from the Signal Domain Using CNNs
TLDR
The resulting ‘direct-fromsignal’ network is competitive with state of the art networks based on conventional features with iVector adaptation and, unlike some previous work on learned feature extractors, the objective function converges as fast as for a network based on traditional features.
Natural head motion synthesis driven by acoustic prosodic features: Virtual Humans and Social Agents
TLDR
This paper presents a novel data-driven approach to synthesize appropriate head motion by sampling from trained hidden markov models (HMMs) and shows that synthesized head motions follow the temporal dynamic behavior of real human subjects.
...
...