Corpus ID: 3603195


  author={D. Roy},
Our goal is to create situation-aware speech systems which integrate knowledge of relevant aspects of the current state of the world into spoken language learning, understanding and generation. Awareness of the situation is achieved by extracting salient information about the speaker’s world from sensors including cameras, touch sensors, and microphones. A key challenge in this approach is to design and integrate linguistic and non-linguistic representations. This paper presents an implemented… Expand
1 Citations

Figures from this paper

A Computational Model of Embodied Language Learning
An implemented computational model of embodied language acquisition that learns words from natural interactions with users that firstly spots words from continuous speech and then associates action verbs and object names with their grounded meanings is presented. Expand


Integration of speech and vision using mutual information
  • D. Roy
  • Computer Science
  • 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)
  • 2000
A system which learns words from co-occurring spoken and visual input to automatically segment continuous speech at word boundaries without a lexicon, and to form visual categories which correspond to spoken words. Expand
Word Spotting from Continuous Speech Utterances
This chapter is concerned with the problem of spotting keywords in continuous speech utterances and Automatic Speech Recognition (ASR) problems that are particularly important in word spotting applications. Expand
Naive physics, event perception, lexical semantics, and language acquisition
Algorithms are presented which utilize a cross-situational learning strategy whereby the learner finds a language model which is consistent across several utterances paired with their non-linguistic context, and three claims about event perception and the process of grounding language in visual perception are advanced. Expand
Unsupervised language acquisition
The thesis introduces a variety of technical innovations, among them a common representation for evidence and grammars that has many linguistically and statistically desirable properties, and a learning strategy that separates the "content" of linguistic parameters from their representation. Expand
Learning words from sights and sounds: a computational model
The model successfully performed speech segmentation, word discovery and visual categorization from spontaneous infant-directed speech paired with video images of single objects, demonstrating the possibility of using state-of-the-art techniques from sensory pattern recognition and machine learning to implement cognitive models which can process raw sensor data without the need for human transcription or labeling. Expand
Detecting acoustic morphemes in lattices for spoken language understanding
Current methods for training statistical language models for recognition and understanding require large annotated corpora. The collection, transcription and labeling of such corpora is a majorExpand
Learning from multimodal observations
  • D. Roy
  • Computer Science
  • 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532)
  • 2000
A working system inspired by infant language learning which learns from untranscribed speech and images is presented and explores the idea of learning from unannotated data by leveraging information across multiple modes of input. Expand
RASTA processing of speech
The theoretical and experimental foundations of the RASTA method are reviewed, the relationship with human auditory perception is discussed, the original method is extended to combinations of additive noise and convolutional noise, and an application is shown to speech enhancement. Expand
On automated language acquisition
The purpose of this paper is to review our investigation into devices which automatically acquire spoken language. The principles and mechanisms underlying this research are described and thenExpand
Transcription and Alignment of the TIMIT Database
The transcription and alignment of the TIMIT database is described, which was performed at MIT, and consists of 6,300 sentences from 639 speakers, representing over 5 hours of speech material, and was recorded by researchers at TI. Expand