A Framework for Integrating Gesture Generation Models into Interactive Conversational Agents

  title={A Framework for Integrating Gesture Generation Models into Interactive Conversational Agents},
  author={Rajmund Nagy and Taras Kucherenko and Birger Moell and Andr{\'e} Pereira and Hedvig Kjellstrom and Ulysses Bernardet},
Embodied conversational agents (ECAs) benefit from non-verbal behavior for natural and efficient interaction with users. Gesticulation – hand and arm movements accompanying speech – is an essential part of non-verbal behavior. Gesture generation models have been developed for several decades: starting with rule-based and ending with mainly data-driven methods. To date, recent endto-end gesture generation methods have not been evaluated in a real-time interaction with users. We present a proof… 

Evaluating data-driven co-speech gestures of embodied conversational agents through real-time interaction

Embodied Conversational Agents (ECAs) that make use of co-speech gestures can enhance human-machine interactions in many ways. In recent years, data-driven gesture generation approaches for ECAs have

Expressing Personality Through Non-verbal Behaviour in Real-Time Interaction

The attribution of traits plays an important role as a heuristic for how we interact with others. Many psychological models of personality are analytical in that they derive a classification from



Gesticulator: A framework for semantically-aware speech-driven gesture generation

This work presents a model designed to produce arbitrary beat and semantic gestures together, which takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.

Synthesizing multimodal utterances for conversational agents

An incremental production model is presented that combines the synthesis of synchronized gestural, verbal, and facial behaviors with mechanisms for linking them in fluent utterances with natural co‐articulation and transition effects.

Investigating the use of recurrent motion modelling for speech gesture generation

This work explores the use of transfer learning using previous motion modelling research to improve learning outcomes for gesture generation from speech, using a recurrent network with an encoder-decoder structure that takes in prosodic speech features and generates a short sequence of gesture motion.

Towards a Common Framework for Multimodal Generation: The Behavior Markup Language

An international effort to unify a multimodal behavior generation framework for Embodied Conversational Agents (ECAs) is described, where the stages represent intent planning, behavior planning and behavior realization is proposed.

Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots

The proposed end-to-end neural network model consists of an encoder for speech text understanding and a decoder to generate a sequence of gestures that successfully produces various gestures including iconic, metaphoric, deictic, and beat gestures.

Generating coherent spontaneous speech and gesture from text

This paper demonstrates a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input, in contrast to previous approaches for joint speech-and-gesture generation.

A friendly gesture: Investigating the effect of multimodal robot behavior in human-robot interaction

This research investigates how humans perceive various gestural patterns performed by the robot as they interact in a situational context and suggests that the robot is evaluated more positively when non-verbal behaviors such as hand and arm gestures are displayed along with speech.

A Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020

The GENEA Challenge was launched, a gesture-generation challenge wherein participating teams built automatic gesture- generation systems on a common dataset, and the resulting systems were evaluated in parallel in a large, crowdsourced user study using the same motion-rendering pipeline.

Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows

This paper proposes a new generative model for generating state‐of‐the‐art realistic speech‐driven gesticulation, called MoGlow, and demonstrates the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent.

Head Motion Generation with Synthetic Speech: A Data Driven Approach

This paper proposes strategies to leverage speech-driven models for head motion generation for cases relying on synthetic speech, and proposes a parallel corpus of synthetic speech aligned with natural recordings for which the authors have motion capture recordings.