Word learning in a multimodal environment

  title={Word learning in a multimodal environment},
  author={D. Roy and A. Pentland},
  journal={Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181)},
  pages={3761-3764 vol.6}
  • D. Roy, A. Pentland
  • Published 1998
  • Computer Science
  • Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181)
We are creating human machine interfaces which let people communicate with machines using natural modalities including speech and gesture. A problem with current multimodal interfaces is that users are forced to learn the set of words and gestures which the interface understands. We report on a trainable interface which lets the user teach the system words of their choice through natural multimodal interactions. 

Figures and Topics from this paper

Learning words from natural audio-visual input
We present a model of early word learning which learns from natural audio and visual input. The model has been successfully implemented to learn words and their audio-visual grounding from camera andExpand
Integration of speech and vision using mutual information
  • D. Roy
  • Computer Science
  • 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)
  • 2000
A system which learns words from co-occurring spoken and visual input to automatically segment continuous speech at word boundaries without a lexicon, and to form visual categories which correspond to spoken words. Expand
Two-way adaptation for robust input interpretation in practical multimodal conversation systems
This work presents a two-way adaptation framework that allows both users and systems to dynamically adapt to each other's capability and needs during the course of interaction and improves the usability and robustness of a conversation system by helping users to dynamically learn the system's capabilities in context. Expand
Robot programming through multi-modal interaction Ph
As robots enter the human environment and come in contact with inexperienced users, they need to be able to interact with users in a multi-modal fashion—keyboard and mouse are no longer acceptable asExpand
Interactive and Incremental Learning via a Multisensory Mobile Robot
A computational efficient scheme is developed that facilitates the robot to learn spoken language online, and react properly to learned speech commands, and is not restricted by the limitations that a speech-to-text mechanism may inherently have. Expand
Learning words from sights and sounds: a computational model
The model successfully performed speech segmentation, word discovery and visual categorization from spontaneous infant-directed speech paired with video images of single objects, demonstrating the possibility of using state-of-the-art techniques from sensory pattern recognition and machine learning to implement cognitive models which can process raw sensor data without the need for human transcription or labeling. Expand
Teachable Interfaces for Individuals with Dysarthric Speech and Severe Physical Disabilities
Standard interfaces including keyboards, mice and speech recognizers pose a major obstacle for individuals with severe speech and physical disabilities. A person with insufficient control of theirExpand
Statistical Model Based Approach to Spoken Language Acquisition
This paper describes an algorithm for spoken language acquisition through natural interface based on the perception of speech and other information conveyed in continuous signal spaices that is robust against ambiguity and sparseness of learning data because it uses statistical learning methods, such as Bayesian learning. Expand
Perceptual Intelligence
Using computer systems that can follow people's actions, recognizing their faces, gestures, and expressions, this technology has begun to make "smart rooms" and "smart clothes" that can help people in day-to-day life without chaining them to keyboards, pointing devices or special goggles. Expand


Multimodal Adaptive Interfaces
Depending on the task at hand and the user’s preferences, she will use a combination of speech and gesture in different ways to communicate her intent. Expand
On automated language acquisition
The purpose of this paper is to review our investigation into devices which automatically acquire spoken language. The principles and mechanisms underlying this research are described and thenExpand
The vocabulary problem in human-system communication
It is shown how this fundamental property of language limits the success of various design methodologies for vocabulary-driven interaction, and an optimal strategy, unlimited aliasing, is derived and shown to be capable of several-fold improvements. Expand
Word Spotting from Continuous Speech Utterances
This chapter is concerned with the problem of spotting keywords in continuous speech utterances and Automatic Speech Recognition (ASR) problems that are particularly important in word spotting applications. Expand
Multimodal interfaces for dynamic interactive maps
Dynamic interactive maps with transparent but powerful human interface capabilities are beginning to emerge for a variety of geographical information systems, including ones situated on portables forExpand
Smart rooms, desks and clothes
  • A. Pentland
  • Computer Science
  • 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing
  • 1997
This research is aimed at giving rooms, desks, and clothes the perceptual and cognitive intelligence needed to become active helpers. Expand
An application of recurrent nets to phone probability estimation
  • A. J. Robinson
  • Computer Science, Medicine
  • IEEE Trans. Neural Networks
  • 1994
Recognition results are presented for the DARPA TIMIT and Resource Management tasks, and it is concluded that recurrent nets are competitive with traditional means for performing phone probability estimation. Expand
RASTA processing of speech
The theoretical and experimental foundations of the RASTA method are reviewed, the relationship with human auditory perception is discussed, the original method is extended to combinations of additive noise and convolutional noise, and an application is shown to speech enhancement. Expand
Communicative humanoids: a computational model of psychosocial dialogue skills
Thesis (Ph. D.)--Massachusetts Institute of Technology, Program in Media Arts & Sciences, 1996.
Multimodal adaptive in- terfaces
  • Technical Report 438, MIT Media Lab Vision and Modeling Group,
  • 1997