A high speed transcription interface for annotating primary linguistic data
The Human Speechome Project is an effort to observe and computationally model the longitudinal course of language development of a single child at an unprecedented scale. The idea is this: Instrument a child’s home so that nearly everything the child hears and sees from birth to three is recorded. Develop a computational model of language learning that takes the child’s audio-visual experiential record as input. Evaluate the model’s performance in matching the child’s linguistic abilities as a means of assessing possible learning strategies used by children in natural contexts. First steps of a pilot effort along these lines are described including issues of privacy management and methods for overcoming limitations of fully-automated machine perception. Stepping into the Shoes of Children To date, the primary means of studying language acquisition has been through observational recordings made in laboratory settings or made at periodic intervals in children’s homes. While laboratory studies provide many useful insights, it has often been argued that the ideal way to observe early child development is in the home where the routines and context of everyday life are minimally disturbed. Bruner’s comment is representative: I had decided that you could only study language acquisition at home, in vivo, not in the lab, in vitro. The issues of context sensitivity and the format of the mother-child interaction had already led me to desert the handsomely equipped but contrived video laboratory...in favor of the clutter of life at home. We went to the children rather than them come to us. [Bruner, 1983] Unfortunately, the quality and quantity of home observation data available is surprisingly poor. Observations made in homes are sparse (typically 1-2 hours per week), and often introduce strong observer effects due to the physical presence of researchers in the home. The fine-grained effects of experience on language acquisition are poorly understood in large part due to this lack of dense longitudinal data [Tomasello and Stahl, 2004]. The Human Speechome Project (HSP) attempts to address these shortcomings by creating the most comprehensive record of a single child’s development to date, coupled with novel data mining and modeling tools to make sense of the resulting massive corpus. The recent Figure 1: The goal of HSP is to create computational models of word learning evaluated on longitudinal in vivo audio-visual recordings. surge in availability of digital sensing and recording technologies enables ultra-dense observation: the capacity to record virtually everything a child sees and hears in his/her home, 24 hours per day for several years of continuous observation. We have designed an ultra-dense observational system based on a digital network of video cameras, microphones, and data capture hardware. The system has been carefully designed to respect infant and caregiver privacy and to avoid participant involvement in the recording process in order to minimize observer effects. The recording system has been deployed and at the time of this writing (January 2006), the data capture phase is six months into operation. Two of the authors (DR, RP) and their first-born child (male, now six months of age, raised with English as the primary language) are the participants. Their home has been instrumented with video cameras and microphones. To date, we have collected 24,000 hours of video and 33,000 hours of audio recordings representing approximately 85% of the child’s waking experience. Over the course of the three-year study this corpus will grow six-fold.