Learning words from sights and sounds: a computational model

@article{Roy2002LearningWF,
  title={Learning words from sights and sounds: a computational model},
  author={Deb K. Roy and Alex Pentland},
  journal={Cogn. Sci.},
  year={2002},
  volume={26},
  pages={113-146}
}
A Computational Model of Word Learning from Multimodal Sensory Input
How do infants segment continuous streams of speech to discover words of their language? Current theories emphasize the role of acoustic evidence in discovering word boundaries (Cutler 1991; Brent
Language Acquisition: The Emergence of Words from Multimodal Input
TLDR
A computational model is discussed that is able to detect and build word-like representations on the basis of multimodal input data and is inspired by the memory structure that is assumed functional for human speech processing.
On a Computational Model for Language Acquisition: Modeling Cross-Speaker Generalisation
TLDR
How internal representations generalize across speakers is investigated in a computational model able to build word-like representations on the basis of multimodal input data without the help of an a priori specified lexicon.
A computational model of language acquisition: focus on word discovery
TLDR
This paper designs and tests a computational model of word discovery that is inspired by the memory structure that is assumed functional for human speech processing and shows that a robust word representation can be learned in using about 50 acoustic tokens of that word.
A Computational Model of Language Acquisition: the Emergence of Words
TLDR
A computational model is designed and tested that is able to detect and build word-like representations on the basis of sensory input and is inspired by the memory structure that is assumed functional for human cognitive processing.
Spontaneous speech recognition using visual context-aware language models
TLDR
The thesis presents a novel situationally-aware multimodal spoken language system called Fuse that performs speech understanding for visual object selection that shows significant decrease in speech recognition and understanding error rates.
A Computational Acquisition Model for Multimodal Word Categorization
TLDR
This work presents a cognitively-inspired, multimodal acquisition model, trained from image-caption pairs on naturalistic data using cross-modal self-supervision, that learns word categories and object recognition abilities, and presents trends reminiscent of ones reported in the developmental literature.
Interactive Learning of Spoken Words and Their Meanings Through an Audio-Visual Interface
TLDR
Experimental results show that the combination of active and unsupervised learning principles enables the machine and the user to adapt to each other, which makes the learning process more efficient.
Developmental Word Grounding Through a Growing Neural Network With a Humanoid Robot
TLDR
An unsupervised approach of integrating speech and visual information without using any prepared data enables a humanoid robot, Incremental Knowledge Robot 1 (IKR1), to learn word meanings.
Active and unsupervised learning for spoken word acquisition through a multimodal interface
  • N. Iwahashi
  • Computer Science
    RO-MAN 2004. 13th IEEE International Workshop on Robot and Human Interactive Communication (IEEE Catalog No.04TH8759)
  • 2004
TLDR
Experimental results show that the method enables a machine and a user to adapt to each other, which makes the learning process more efficient.
...
...

References

SHOWING 1-10 OF 171 REFERENCES
The Unsupervised Acquisition of a Lexicon from Continuous Speech
We present an unsupervised learning algorithm that acquires a natural- language lexicon from raw speech. The algorithm is based on the optimal encoding of symbol sequences in an MDL framework, and
Learning words from natural audio-visual input
We present a model of early word learning which learns from natural audio and visual input. The model has been successfully implemented to learn words and their audio-visual grounding from camera and
Improving connected letter recognition by lipreading
The authors show how recognition performance in automated speech perception can be significantly improved by additional lipreading, so called speech-reading. They show this on an extension of a
Learning from multimodal observations
  • D. Roy
  • Computer Science
    2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532)
  • 2000
TLDR
A working system inspired by infant language learning which learns from untranscribed speech and images is presented and explores the idea of learning from unannotated data by leveraging information across multiple modes of input.
Connectionist Speech Recognition: A Hybrid Approach
From the Publisher: Connectionist Speech Recognition: A Hybrid Approach describes the theory and implementation of a method to incorporate neural network approaches into state-of-the-art continuous
Visual Recognition of American Sign Language Using Hidden Markov Models.
TLDR
Using hidden Markov models (HMM's), an unobstrusive single view camera system is developed that can recognize hand gestures, namely, a subset of American Sign Language (ASL), achieving high recognition rates for full sentence ASL using only visual cues.
Statistical Learning by 8-Month-Old Infants
TLDR
The present study shows that a fundamental task of language acquisition, segmentation of words from fluent speech, can be accomplished by 8-month-old infants based solely on the statistical relationships between neighboring speech sounds.
Unsupervised language acquisition
TLDR
The thesis introduces a variety of technical innovations, among them a common representation for evidence and grammars that has many linguistically and statistically desirable properties, and a learning strategy that separates the "content" of linguistic parameters from their representation.
An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery
  • M. Brent
  • Computer Science
    Machine Learning
  • 2004
TLDR
Experiments on phonemic transcripts of spontaneous speech by parents to young children suggest that the model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted is more effective than other proposed algorithms, at least when utterance boundaries are given and the text includes a substantial number of short utterances.
...
...