Learn More
A spoken language generation system has been developed that learns to describe objects in computer-generated visual scenes. The system is trained by a 'show-and-tellÕ procedure in which visual scenes are paired with natural language descriptions. Learning algorithms acquire prob-abilistic structures which encode the visual semantics of phrase structure,(More)
  • Deb Roy
  • 2005
We use words to communicate about things and kinds of things, their properties, relations and actions. Researchers are now creating robotic and simulated systems that ground language in machine perception and action, mirroring human abilities. A new kind of computational model is emerging from this work that bridges the symbolic realm of language with the(More)
The meaning of words in everyday language depends on two very different kinds of relations. On one hand, words refer to (are about) the world. This relation rests on causal interactions between information and the physical world. On the other hand, agents use words to pursue goals by producing speech acts. A complete model of language must bridge these two(More)
We report on an audio retrieval system which lets Internet users efficiently access a large audio database containing recordings of the proceedings of the United States House of Representatives. The audio has been temporally aligned to text transcripts of the proceedings (which are manually generated by the U.S. Government) using a novel method based on(More)
The NewsComm system delivers personalized news and other program material as audio to mobile users through a hand-held playback device. This paper focuses on the iterative design and user testing of the hand-held interface. The interface was first designed and tested in a software-only environment and then ported to a custom hardware platform. The hand-held(More)
This paper reports results from early experiments on automatic classification of spoken affect. The task was to classify short spoken sentences into one of two affect classes: approving or disapproving. Using an optimal combination of six acoustic measurements our classifier achieved an accuracy of 65% to 88% for speaker dependent, text-independent(More)
Introduction Our group is interested in creating human machine interfaces which use natural modalities such as vision and speech to sense and interpret a user's actions (0). In this paper we describe recent w ork on multimodal adaptive i n terfaces which combine automatic speech recognition, computer vision for gesture tracking, and machine learning(More)
As a step toward simulating dynamic dialogue between agents and humans in virtual environments, we describe learning a model of social behavior composed of interleaved utterances and physical actions. In our model, utterances are abstracted as {speech act, propositional content, referent} triples. After training a classifier on 100 gameplay logs from The(More)