A new approach to the recognition of temporal behaviors and activities is presented. The fundamental idea, inspired by work in speech recognition, is to divide the inference problem into two levels. The lower level is performed using standard independent probabilistic temporal event detectors such as hidden Markov models (HMMs) to propose candidate detections of low level temporal features. The outputs of these detectors provide the input stream for a stochastic context-free grammar parsing mechanism. The grammar and parser provide longer range temporal constraints, disambiguate uncertain low level detections, and allow the inclusion of a priori knowledge about the structure of temporal events in a given domain. To achieve such a system we provide techniques for generating a discrete symbol stream from continuous low level detectors, for enforcing temporal exclusion constraints during parsing, and for generating a control method for low level feature application based upon the current parsing state. We demonstrate the approach in several experiments using both visual and other sensing data. 1 Intoduction: stochastic action recognition In the last several years there has been a tremendous growth in the amount of computer vision research aimed at understanding action. As noted by Bobick  these efforts have ranged from the interpretation of basic movements such as recognizing someone walking or sitting, to the more abstract task of providing a Newtonian physics description of of the motion of several objects. In particular, there has been emphasis on activities or behaviors where the entity to be recognized may be considered as a stochastically predictable sequence of states. The greatest number of examples come form work in gesture recognition [14, 2, 13] where analogies to speech and handwriting recognition inspired researchers to devise hidden Markov model methods for the classification of gestures. The basic premise of the approach is that the visual phenomena observed can be considered Markovian in some feature space, and that sufficient training data exists to automatically learn a suitable model to characterize the data. Our research interests lie in the area of vision where observations span extended periods of time. We often find ourselves in the situations where purely statistical approaches to recognition are less than ideal. These situations can be characterized by one or more of the following properties: • complete data sets are not always available, but smaller examples could easily be found; • semantically equivalent processes possess radically different statistical properties; • competing hypotheses can absorb different lengths of the input stream raising the need for naturally supported temporal segmentation; • structure of the process is difficult to learn but is explicit and a priori known. Consider a simple example we can draw a square with a hand in the air in either clockwise or counterclockwise direction. In either case our “square” model should indicate that the square is being drawn. This seemingly simple task requires significant effort using only the statistical pattern recognition techniques. The human observer, on the other hand, can provide a set of useful heuristics for a system which would model the human’s higher level perception. As we recognize the need to characterize a signal by these heuristics, we turn our attention to syntactic pattern recognition and combined statistical-syntactic approaches, which would allow us to address the problems listed above. To take advantage of these techniques, we divide the activity recognition problem into two components. The lower level is performed using standard independent probabilistic temporal event detectors such as HMMs to propose candidate detections of low level temporal features. The outputs of these detectors provide the input stream for a stochastic context-free grammar parsing mechanism. The grammar and parser enforce longer range temporal constraints, disambiguate or correct uncertain or mislabeled low level detections, and allow the inclusion of a priori knowledge about the structure of temporal events in a given domain. For many domains such a division is clear. For example, consider ballroom dancing. There are a small number of primitives (e.g. right-leg-back) which are then structured into higher level units (e.g. box-step, quarter-turn, etc.). Typically one will have many examples of right-leg-back drawn from the relatively few examples each of the higher level behaviors. Another example might be recognizing a car executing a parallel parking maneuver. The higher level activity can be described as first a car executes an pull-alongside primitive followed by an arbitrary number of cycles through the pattern turn-wheels-left, backup, turn-wheels-right, pull-forward. In these instances, there is a natural division between atomic, statistically abundant primitives and higher level coordinated behavior.