Deep Learning Based Multi-modal Addressee Recognition in Visual Scenes with Utterances

  title={Deep Learning Based Multi-modal Addressee Recognition in Visual Scenes with Utterances},
  author={Thao Le Minh and N. Shimizu and Takashi Miyazaki and Koichi Shinoda},
With the widespread use of intelligent systems, such as smart speakers, addressee recognition has become a concern in human-computer interaction, as more and more people expect such systems to understand complicated social scenes, including those outdoors, in cafeterias, and hospitals. Because previous studies typically focused only on pre-specified tasks with limited conversational situations such as controlling smart homes, we created a mock dataset called Addressee Recognition in Visual… Expand
Cross-Corpus Data Augmentation for Acoustic Addressee Detection
This study distinguishes addressees in two settings, and introduces the first competitive baseline (unweighted average recall equals 0.891) for the Voice Assistant Conversation Corpus that models the first setting, and jointly solves both classification problems using three models. Expand
A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses
The results demonstrate that first-person vision helps neural network models correctly understand human intentions, and the production of non-verbal responses is a challenging task like that of verbal responses. Expand
A novel focus encoding scheme for addressee detection in multiparty interaction using machine learning algorithms
This research work improves existing baseline accuracies for addressee prediction on two datasets and explores the impact of different focus encoding schemes in severalAddressee detection cases. Expand
Multimodal Response Obligation Detection with Unsupervised Online Domain Adaptation
This paper proposes a novel multimodal response obligation detector, which uses visual, audio, and text information for highly-accurate detection, with its unsupervised online domain adaptation to solve the domain dependency problem. Expand
Using Multimodal Information to Enhance Addressee Detection in Multiparty Interaction
A statistical approach based on smart feature selection that exploits contextual and multimodal information for addressee detection is proposed and the results show that the model outperforms an existing baseline. Expand
A Generic Machine Learning Based Approach for Addressee Detection In Multiparty Interaction
This article proposes a model based on generic features to predict the addressee in data sets with varying number of participants and shows that the proposed model outperforms existing baselines. Expand
Embodied Conversational AI Agents in a Multi-modal Multi-agent Competitive Dialogue
This work shows two embodied AI shopkeeper agents who sell similar items aiming to get the business of a user by competing with each other on the price by using headpose (estimated by deep learning techniques) to determine who the user is talking to. Expand
You Talkin' to Me? A Practical Attention-Aware Embodied Agent
The results of two studies in which users engage with an assistant that infers whether it is being addressed from the user’s head orientation are presented, establishing that, with high confidence, head orientation combined with visual feedback is preferable to the traditional wake-up word approach. Expand
HUMAINE: Human Multi-Agent Immersive Negotiation Competition
This work is developing a platform that supports a new type of AI competition that involves both agent-agent and human-agent interactions situated in an immersive environment and presents the platform architecture and accompanying technologies. Expand


Deep Learning for Acoustic Addressee Detection in Spoken Dialogue Systems
Deep learning methods such as fully-connected neural networks and Long Short-Term Memory were applied and acoustic data has been chosen as the main modality by reason of the most flexible usability in modern SDSs to resolve the problem of addressee detection. Expand
A Study of Multimodal Addressee Detection in Human-Human-Computer Interaction
It is suggested that acoustic, lexical, and system-state information is an effective and practical combination of modalities to use for addressee detection in multiparty, open-world dialogue systems in which the agent plays an active, conversational role. Expand
The vernissage corpus: A conversational Human-Robot-Interaction dataset
A new conversational Human-Robot-Interaction (HRI) dataset with a real-behaving robot inducing interactive behavior with and between humans, involving a humanoid robot NAO1 explaining paintings in a room and then quizzing the participants, who are naive users. Expand
Speech and Text Analysis for Multimodal Addressee Detection in Human-Human-Computer Interaction
The connection between different levels of analysis and the classification performance for different categories of speech and the dependence of addressee detection performance on speech recognition accuracy is defined and a universal meta-model based on acoustical and syntactical analysis is proposed, which may theoretically be applied in different domains. Expand
Combining dynamic head pose-gaze mapping with the robot conversational state for attention recognition in human-robot interactions
A dynamic Bayesian model for the VFOA recognition from head pose is proposed, and a novel gaze models that dynamically and more accurately predict the expected head orientation used for looking in a given gaze target direction are proposed. Expand
DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
DeCAF, an open-source implementation of deep convolutional activation features, along with all associated network parameters, are released to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms. Expand
Identifying the intended addressee in mixed human-human and human-computer interaction from non-verbal features
Eye gaze both of speaker and listener, dialogue history and utterance length, and detailed timing of shifts in eye gaze between different communication partners are inspected to result in an improved classification of utterances in terms of addressee-hood relative to a simple classification algorithm. Expand
Show and tell: A neural image caption generator
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. Expand
Addressee and Response Selection for Multi-Party Conversation
This work tackling addressee and response selection for multi-party conversation, in which systems are expected to select whom they address as well as what they say, and proposes two modeling frameworks. Expand
Head Pose Patterns in Multiparty Human-Robot Team-Building Interactions
A data collection setup for exploring turn-taking in three-party human-robot interaction involving objects competing for attention and it is argued that this symmetry can be used to assess to what extent the system exhibits a human-like behavior. Expand