Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs

  title={Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs},
  author={Matthew Roddy and Gabriel Skantze and Naomi Harte},
  journal={Proceedings of the 20th ACM International Conference on Multimodal Interaction},
In human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities. To design spoken dialog systems that can conduct fluid interactions it is desirable to incorporate cues from separate modalities into turn-taking models. We propose that there is an appropriate temporal granularity at which modalities should be modeled. We design a multiscale RNN architecture to model modalities at separate timescales in a continuous manner. Our results show that… 

Figures and Tables from this paper

Turn-Taking Predictions across Languages and Genres Using an LSTM Recurrent Neural Network

An improved recurrent network model is presented that outperforms [1] and does so without requiring lexical annotation, providing good results in turn-taking prediction for English, Spanish, Japanese, Mandarin and French.

Smooth Turn-taking by a Robot Using an Online Continuous Model to Generate Turn-taking Cues

This work constructed the continuous model using the speaker’s prosodic features as inputs and evaluated its online performance, and conducted a subjective experiment in which participants were asked to compare it to one without turn-taking cues, which produces a response when a speech recognition result is received.

Turn-Taking Prediction Based on Detection of Transition Relevance Place

This study proposes taking into account the concept of the transition relevance place (TRP) for turn-taking prediction, and conducts annotation of TRP on a human-robot dialogue corpus, ensuring the objectivity of this annotation among annotators.

Neural Generation of Dialogue Response Timings

It is shown that human listeners consider certain response timings to be more natural based on the dialogue context, and the introduction of these models into SDS pipelines could increase the perceived naturalness of interactions.

Voice Activity Projection: Self-supervised Learning of Turn-taking Events

The predictive task of Voice Activity Projection is defined, a general, self-supervised objective, as a way to train turn-taking models without the need of la-beled data, and the proposed model outperforms prior work.

Investigating Linguistic and Semantic Features for Turn-Taking Prediction in Open-Domain Human-Computer Conversation

This paper focuses primarily on the predictive potential of linguistic features, including lexical, syntactic and semantic features, as well as timing features, whereas past work has typically placed more emphasis on prosodic features, sometimes supplemented with non-verbal behaviors such as gaze and head movements.

Response Timing Estimation for Spoken Dialog System using Dialog Act Estimation

We propose neural networks for predicting response timing of spoken dialog systems. Response timing varies depending on the dialog context. This context-dependent response timing is conventionally

The Duration of a Turn Cannot be Used to Predict When It Ends

Turn taking in conversation is a complex process. We still don’t know how listeners are able to anticipate the end of a speaker’s turn. Previous work focuses on prosodic, semantic, and non-verbal



Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs

It is found that while traditional acoustic features perform well, part-of-speech features generally perform worse than word features, and the current models outperform previously reported baselines.

Towards a General, Continuous Model of Turn-taking in Spoken Dialogue using LSTM Recurrent Neural Networks

A predictive, continuous model of turn-taking using Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNN) trained on human-human dialogue data to predict upcoming speech activity in a future time window is presented.

Multimodal end-of-turn prediction in multi-party meetings

The paper presents a multimodal approach to end-of-speaker-turn prediction using sequential probabilistic models (Conditional Random Fields) to learn a model from observations of real-life multi-party meetings.

A Finite-State Turn-Taking Model for Spoken Dialog Systems

Evaluation results on a deployed spoken dialog system show that the FSTTM provides significantly higher responsiveness than previous approaches to the problem of end-of-turn detection.

Is the speaker done yet? faster and more accurate end-of-utterance detection using prosody

A new approach to EOU detection that uses prosodic features and an event N-gram language model to obtain a score that measures the likelihood that any nonspeech region is an EOU.

Towards Deep End-of-Turn Prediction for Situated Spoken Dialogue Systems

This work was supported by the Cluster of Excellence Cognitive Interaction Technology ‘CITEC’ (EXC 277) at Bielefeld University, and the DFG-funded DUEL project (grant SCHL 845/5-1).

The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing

A basic standard acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis, is proposed and intended to provide a common baseline for evaluation of future research and eliminate differences caused by varying parameter sets or even different implementations of the same parameters.

The Hcrc Map Task Corpus

A corpus of unscripted, task-oriented dialogues which has been designed, digitally recorded, and transcribed to support the study of spontaneous speech on many levels is described.