Ripley, Hand Me the Cup! (sensorimotor Representations for Grounding Word Meaning)

Abstract

People leverage situational context when using language. Rather than convey all information through words, listeners can infer speakers’ meanings due to shared common ground [1, 2]. For machines to engage fully in conversation with humans, they must also link words to the world. We present a sensorimotor representation for physically grounding action verbs, modifiers, and spatial relations. We demonstrate an implementation of this framework in an interactive robot that uses the grounded lexicon to translate spoken commands into situationally appropriate actions. 1. SITUATED SPOKEN LANGUAGE Speakers use spoken language to convey meaning to listeners by leveraging situational context. Context includes many levels of knowledge ranging from fine grain details of shared physical environments to shared cultural norms. As the degree of shared context decreases between communication partners, the efficiency of language also decreases since the speaker is forced to explicate increasing quantities of information that could otherwise be left unsaid. A sufficient lack of common ground can lead to communication failures. If machines are to engage in meaningful, fluent, situated spoken dialog, they must be aware of their situational context. As a starting point, we focus our attention on physical context. A machine that is aware of where it is, what it is doing, the presence and activities of other objects and people which are in its vicinity, and salient aspects of recent history, can use these contextual factors to understand spoken language in a context-dependent manner. A concrete example helps illustrate how a machine can make use of situational context. Consider a speech interface to the lights in a room1. If a person simply says, “Lights!”, the appropriate action will depend on the current state of the light. If it is already on, the command means turn off, 1Ignoring, for the moment, the difficult issue of microphone placement and background noise that would also need attention. but if it is already off, it means the opposite. In this simple example, the language understander needs access to a single bit of situational context, the current state of the light. Consider a slightly richer problem, still in the domain of the light controller. How should the spoken command softer be interpreted by the light? Perhaps the simplest solution would be to decrease the intensity of the light by a fixed amount. Although this solution might be functional, it is not necessarily the most natural. In contrast to a fixed-interval solution, a person responding to this request would be likely to decrease the intensity by an amount that is a function of the intensity of light in the room at the time of the request. In general, many sources of light (e.g., from a setting sun) may contribute to the total ambient light in the room. For a machine to leverage this situational information, we could add a light sensor to the controller that is able to monitor ambient lighting conditions. A context-dependent interpretation of “softer” could then be defined. 1.1. Language Grounding A necessary step towards creating situated speech processing systems is to develop representations and procedures that enable machines to ground the meaning of words in their physical environments. In contrast to dictionary definitions that represent words in terms of other words (leading, inevitably, to circular definitions for all words), grounded definitions anchor word meanings in non-linguistic primitives. Assuming that a machine has access to its environment through appropriate sensory channels, language grounding enables machines to link linguistic meanings to elements of the machine’s environment. From environmentally aware light controllers to car navigation systems that see the same visual landmarks as the driver, the idea of a context-grounded speech processing is the tip of a very large iceberg. We believe that a large class of spoken language understanding applications may benefit from language grounding. We will refer to this class of systems as having grounded semantics in light of the explicit links of semantic representations to the machine’s physical

5 Figures and Tables

Cite this paper

@inproceedings{Roy2003RipleyHM, title={Ripley, Hand Me the Cup! (sensorimotor Representations for Grounding Word Meaning)}, author={Deb Roy and Kai-Yuh Hsiao and Nikolaos Mavridis and Peter Gorniak}, year={2003} }