Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs

  title={Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs},
  author={Matthew Roddy and Gabriel Skantze and Naomi Harte},
  journal={Proceedings of the 20th ACM International Conference on Multimodal Interaction},
In human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities. To design spoken dialog systems that can conduct fluid interactions it is desirable to incorporate cues from separate modalities into turn-taking models. We propose that there is an appropriate temporal granularity at which modalities should be modeled. We design a multiscale RNN architecture to model modalities at separate timescales in a continuous manner. Our results show that… 

Figures and Tables from this paper

Turn-Taking Predictions across Languages and Genres Using an LSTM Recurrent Neural Network

An improved recurrent network model is presented that outperforms [1] and does so without requiring lexical annotation, providing good results in turn-taking prediction for English, Spanish, Japanese, Mandarin and French.

Smooth Turn-taking by a Robot Using an Online Continuous Model to Generate Turn-taking Cues

This work constructed the continuous model using the speaker’s prosodic features as inputs and evaluated its online performance, and conducted a subjective experiment in which participants were asked to compare it to one without turn-taking cues, which produces a response when a speech recognition result is received.

Turn-Taking Prediction Based on Detection of Transition Relevance Place

This study proposes taking into account the concept of the transition relevance place (TRP) for turn-taking prediction, and conducts annotation of TRP on a human-robot dialogue corpus, ensuring the objectivity of this annotation among annotators.

Neural Generation of Dialogue Response Timings

It is shown that human listeners consider certain response timings to be more natural based on the dialogue context, and the introduction of these models into SDS pipelines could increase the perceived naturalness of interactions.

Voice Activity Projection: Self-supervised Learning of Turn-taking Events

Prior work is extended and the predictive task of Voice Activity Projection is extended, a general, self-supervised objective, as a way to train turn-taking models without the need of la-beled data, and the proposed model outperforms prior work.

Investigating Linguistic and Semantic Features for Turn-Taking Prediction in Open-Domain Human-Computer Conversation

This paper focuses primarily on the predictive potential of linguistic features, including lexical, syntactic and semantic features, as well as timing features, whereas past work has typically placed more emphasis on prosodic features, sometimes supplemented with non-verbal behaviors such as gaze and head movements.

The Duration of a Turn Cannot be Used to Predict When It Ends

Turn taking in conversation is a complex process. We still do not know how listeners are able to anticipate the end of a speaker’s turn. Previous work focuses on prosodic, semantic, and non-verbal

Timing Generating Networks: Neural Network Based Precise Turn-Taking Timing Prediction in Multiparty Conversation

A brand new neural network based precise timing generation framework, named the Timing Generating Network (TGN), is proposed and applied to turn-taking timing decision problems and the experimental results show that the proposed system is superior to a conventional turn- taking system that adopts the hard decisions on user’s voice activity detection and response time estimation.



Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs

It is found that while traditional acoustic features perform well, part-of-speech features generally perform worse than word features, and the current models outperform previously reported baselines.

Towards a General, Continuous Model of Turn-taking in Spoken Dialogue using LSTM Recurrent Neural Networks

A predictive, continuous model of turn-taking using Long Short-Term Memory (LSTM) Recurrent Neural Networks (RNN) trained on human-human dialogue data to predict upcoming speech activity in a future time window is presented.

Multimodal end-of-turn prediction in multi-party meetings

The paper presents a multimodal approach to end-of-speaker-turn prediction using sequential probabilistic models (Conditional Random Fields) to learn a model from observations of real-life multi-party meetings.

A Finite-State Turn-Taking Model for Spoken Dialog Systems

Evaluation results on a deployed spoken dialog system show that the FSTTM provides significantly higher responsiveness than previous approaches to the problem of end-of-turn detection.

Is the speaker done yet? faster and more accurate end-of-utterance detection using prosody

This work develops a new approach to EOU detection that uses prosodic features, modeled by decision trees and combined with an event N-gram language model to obtain a score that measures the likelihood that any nonspeech region is an EOU.

Towards Deep End-of-Turn Prediction for Situated Spoken Dialogue Systems

This work was supported by the Cluster of Excellence Cognitive Interaction Technology ‘CITEC’ (EXC 277) at Bielefeld University, and the DFG-funded DUEL project (grant SCHL 845/5-1).

The Hcrc Map Task Corpus

A corpus of unscripted, task-oriented dialogues which has been designed, digitally recorded, and transcribed to support the study of spontaneous speech on many levels is described.

Pauses, gaps and overlaps in conversations