Corpus ID: 11196368

A Computational Model of Word Learning from Multimodal Sensory Input

@inproceedings{Roy2000ACM,
  title={A Computational Model of Word Learning from Multimodal Sensory Input},
  author={D. Roy},
  year={2000}
}
How do infants segment continuous streams of speech to discover words of their language? Current theories emphasize the role of acoustic evidence in discovering word boundaries (Cutler 1991; Brent 1999; de Marcken 1996; Friederici & Wessels 1993; see also Bolinger & Gertsman 1957). To test an alternate hypothesis, we recorded natural infant-directed speech from caregivers engaged in play with their pre-linguistic infants centered around common objects. We also recorded the visual context in… Expand

Figures from this paper

A computational model for unsupervised childlike speech acquisition
TLDR
A model for early infant speech structure acquisition implemented as layered architecture comprising phones, syllables and words and an integrated model for speech structure and imitation learning through interaction, that enables the authors' robot to learn to speak with an own voice. Expand
A DNN-HMM-DNN Hybrid Model for Discovering Word-Like Units from Spoken Captions and Image Regions
TLDR
This work demonstrates a model inspired by statistical machine translation and hidden Markov model/deep neural network (HMMDNN) hybrid systems that outperforms the audio-only segmental embedded GMM approach on standard word discovery evaluation metrics. Expand
Multimodal Semantic Learning from Child-Directed Input
TLDR
This work presents a distributed word learning model that operates on child-directed speech paired with realistic visual scenes that integrates linguistic and extra-linguistic information, handles referential uncertainty, and correctly learns to associate words with objects, even in cases of limited linguistic exposure. Expand
Multimodal Semantic Learning from Child-Directed Input
Children learn the meaning of words by being exposed to perceptually rich situations (linguistic discourse, visual scenes, etc). Current computational learning models typically simulate these richExpand
Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval
TLDR
The theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation, and it is empirically demonstrated that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system. Expand
Interactive learning of words and objects for a humanoid robot. (Apprentissage interactif de mots et d'objets pour un robot humanoïde)
TLDR
This thesis highlights the algorithmic solutions required to be able to perform efficient learning of these word-referent associations from data acquired in a simplified but realistic acquisition setup that made it possible to perform extensive simulations and preliminary experiments in real human-robot interactions. Expand
Comparison Studies on Active Cross-Situational Object-Word Learning Using Non-Negative Matrix Factorization and Latent Dirichlet Allocation
TLDR
This work proposes two cross-situational learning methods that tackle referential and linguistic ambiguities, and can be associated with active learning strategies, and proposes two such methods: the maximum reconstruction error-based selection and confidence base exploration. Expand
Affective Learning — A Manifesto
TLDR
A large perspective of new research in which computer technology is used to redress the imbalance that was caused (or, at least, accentuated) by the computer itself is projects. Expand
Language Evolution and Robotics: Issues on Symbol Grounding and Language Acquisition
One of the key aspects that distinguishes humans from other species is that humans use a complex communication system that is among other things symbolic, learnt, compositional and recursive, whereasExpand
Developmental and Evolutionary Lexicon Acquisition in Cognitive Agents/Robots with Grounding Principle: A Short Review
TLDR
This review not only presents a survey of the methodologies and relevant computational cognitive agents or robotic models, but also highlights the advantages and progress of these approaches for the language grounding issue. Expand
...
1
2
...

References

SHOWING 1-10 OF 23 REFERENCES
Learning words from sights and sounds: a computational model
TLDR
The model successfully performed speech segmentation, word discovery and visual categorization from spontaneous infant-directed speech paired with video images of single objects, demonstrating the possibility of using state-of-the-art techniques from sensory pattern recognition and machine learning to implement cognitive models which can process raw sensor data without the need for human transcription or labeling. Expand
Segmentation problems, rhythmic solutions *
Abstract The lexicon contains discrete entries, which must be located in speech input in order for speech to be understood; but the continuity of speech signals means that lexical access from spokenExpand
Unsupervised language acquisition
TLDR
The thesis introduces a variety of technical innovations, among them a common representation for evidence and grammars that has many linguistically and statistically desirable properties, and a learning strategy that separates the "content" of linguistic parameters from their representation. Expand
Integration of speech and vision using mutual information
  • D. Roy
  • Computer Science
  • 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)
  • 2000
TLDR
A system which learns words from co-occurring spoken and visual input to automatically segment continuous speech at word boundaries without a lexicon, and to form visual categories which correspond to spoken words. Expand
Transcription and Alignment of the TIMIT Database
TLDR
The transcription and alignment of the TIMIT database is described, which was performed at MIT, and consists of 6,300 sentences from 639 speakers, representing over 5 hours of speech material, and was recorded by researchers at TI. Expand
Early word meanings: The case of object names
Abstract It has been claimed that young children use object names overgenerally and undergenerally because they do not have notions of objects of particular kinds, but rather, complexive notions ofExpand
Foundations of Cyclopean Perception
This classic work on cyclopean perception has influenced a generation of vision researchers, cognitive scientists, and neuroscientists and has inspired artists, designers, and computer graphicsExpand
Toco the toucan: a synthetic character guided by perception, emotion, and story
  • D. Roy
  • Psychology, Computer Science
  • SIGGRAPH '97
  • 1997
TLDR
This exhibit demonstrates the integration of several key technologies including behavior-based animation, interactive storytelling, robust computer audition and vision, and affective computing. Expand
The magical number seven plus or minus two: some limits on our capacity for processing information.
TLDR
The theory provides us with a yardstick for calibrating the authors' stimulus materials and for measuring the performance of their subjects, and the concepts and measures provided by the theory provide a quantitative way of getting at some of these questions. Expand
Linguistic experience alters phonetic perception in infants by 6 months of age.
TLDR
This study of 6-month-old infants from two countries, the United States and Sweden, shows that exposure to a specific language in the first half year of life alters infants' phonetic perception. Expand
...
1
2
3
...