Corpus ID: 16421182

A visual context-aware multimodal system for spoken language processing

@inproceedings{Mukherjee2003AVC,
  title={A visual context-aware multimodal system for spoken language processing},
  author={Niloy J. Mukherjee and Deb K. Roy},
  booktitle={INTERSPEECH},
  year={2003}
}
Recent psycholinguistic experiments show that acoustic and syntactic aspects of online speech processing are influenced by visual context through cross-modal influences. During interpretation of speech, visual context seems to steer speech processing and vice versa. We present a real-time multimodal system motivated by these findings that performs early integration of visual contextual information to recognize the most likely word sequences in spoken language utterances. The system first… Expand
Mental imagery for a conversational robot
  • D. Roy, K. Hsiao, N. Mavridis
  • Computer Science, Medicine
  • IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics)
  • 2004
TLDR
A set of representations and procedures are presented that enable a robotic manipulator to maintain a "mental model" of its physical environment by coupling active vision to physical simulation and providing the basis for situated language comprehension and production. Expand
Grounded Language Modeling for Automatic Speech Recognition of Sports Video
TLDR
Results show that grounded language models improve perplexity and word error rate over text based language models, and further, support video information retrieval better than human generated speech transcriptions. Expand
Look at the real world through the images and talk about it
TLDR
The results indicate that learners develop their English speaking skill when their interests, specific intelligence, and the use of the real language are evolved in the English teaching process. Expand
Implementation of audiovisual material in an early sequential bilingual model during the early years
This research arose from the need to consolidate a meaningful bilingual methodology for children from three to five years of age from low socioeconomic backgrounds belonging to the public educationExpand
The use of video resource as reinforcement of english language teaching process at elementary level addressed to professors and administrative staff of San Francisco de Asis University of La Paz city
The following Guided Project was accomplished to suggest the use of video resource as reinforcement of the English Language Teaching Process at elementary level addressed to sixteen professors andExpand
The Comparative Effect of Using Visual and Auditory Input Enhancement on the Use of Cohesive Devices in the Writing of Iranian EFL Filed-dependent and independent Learners
Writing has been a troublesome skill for Iranian EFL learners as it needs accurate planning and acceptable coherence. The current study aimed as investigating the comparative effect of visual andExpand
Theory and application of audiovisual materials in the English classroom
........................................................................................
Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques
TLDR
An overview of the evolution of visually grounded models of spoken language over the last 20 years is provided, which discusses the central research questions addressed, the timeline of developments, and the datasets which enabled much of this work. Expand

References

SHOWING 1-10 OF 11 REFERENCES
Integration of visual and linguistic information in spoken language comprehension.
TLDR
To test the effects of relevant visual context on the rapid mental processes that accompany spoken language comprehension, eye movements were recorded with a head-mounted eye-tracking system while subjects followed instructions to manipulate real objects. Expand
A trainable spoken language understanding system for visual object selection
We present a trainable, visually-grounded, spoken language understanding system. The system acquires a grammar and vocabulary from a “show-and-tell” procedure in which visual scenes are paired withExpand
Linguistically Mediated Visual Search
TLDR
It is found that when a conjunction target was identified by a spoken instruction presented concurrently with the visual display, the incremental processing of spoken language allowed the search process to proceed in a manner considerably less affected by the number of distractors. Expand
Learning visually grounded words and syntax for a scene description task
  • D. Roy
  • Computer Science
  • Comput. Speech Lang.
  • 2002
TLDR
A spoken language generation system that learns to describe objects in computer-generated visual scenes and generates syntactically well-formed compound adjective noun phrases, as well as relative spatial clauses was comparable to human-generated descriptions. Expand
Eye Movements and Lexical Access in Spoken-Language Comprehension: Evaluating a Linking Hypothesis between Fixations and Linguistic Processing
TLDR
The results provide evidence about the time course of lexical activation that resolves some important theoretical issues in spoken-word recognition and demonstrate that fixations are sensitive to properties of the normal language-processing system that cannot be attributed to task-specific strategies. Expand
Grounding spatial language in perception: an empirical and computational investigation.
TLDR
The authors conclude that the structure of linguistic spatial categories can be partially explained in terms of independently motivated perceptual processes. Expand
Two decades of statistical language modeling: where do we go from here?
TLDR
A Bayesian approach to integration of linguistic theories with data is argued for inStatistical language models estimate the distribution of various natural language phenomena for the purpose of speech recognition and other language technologies. Expand
Class-Based n-gram Models of Natural Language
TLDR
This work addresses the problem of predicting a word from previous words in a sample of text and discusses n-gram models based on classes of words, finding that these models are able to extract classes that have the flavor of either syntactically based groupings or semanticallybased groupings, depending on the nature of the underlying statistics. Expand
Spontaneous speech recognition using hidden markov models
  • Spontaneous speech recognition using hidden markov models
  • 2001
...
1
2
...