A visual context-aware multimodal system for spoken language processing

  title={A visual context-aware multimodal system for spoken language processing},
  author={Niloy J. Mukherjee and Deb K. Roy},
  journal={8th European Conference on Speech Communication and Technology (Eurospeech 2003)},
  • Niloy J. Mukherjee, D. Roy
  • Published 1 September 2003
  • Psychology
  • 8th European Conference on Speech Communication and Technology (Eurospeech 2003)
Recent psycholinguistic experiments show that acoustic and syntactic aspects of online speech processing are influenced by visual context through cross-modal influences. During interpretation of speech, visual context seems to steer speech processing and vice versa. We present a real-time multimodal system motivated by these findings that performs early integration of visual contextual information to recognize the most likely word sequences in spoken language utterances. The system first ac-quires… 

Figures and Tables from this paper

Learning English with Peppa Pig

A simple bi-modal architecture is trained on the portion of the data consisting of dialog between characters, and evaluated on segments containing descriptive narrations that succeeds at learning aspects of the visual semantics of spoken language.

Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques

An overview of the evolution of visually grounded models of spoken language over the last 20 years is provided, which discusses the central research questions addressed, the timeline of developments, and the datasets which enabled much of this work.

Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations

  • Dan OneațăH. Cucu
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
  • 2022
It is shown that starting from a pretrained ASR significantly improves the state-of-the-art performance; remarkably, even when building upon a strong unimodal system, the system still finds gains by including the visual modality.

Mental imagery for a conversational robot

A set of representations and procedures are presented that enable a robotic manipulator to maintain a "mental model" of its physical environment by coupling active vision to physical simulation and providing the basis for situated language comprehension and production.

Grounded Language Modeling for Automatic Speech Recognition of Sports Video

Results show that grounded language models improve perplexity and word error rate over text based language models, and further, support video information retrieval better than human generated speech transcriptions.

Look at the real world through the images and talk about it

The results indicate that learners develop their English speaking skill when their interests, specific intelligence, and the use of the real language are evolved in the English teaching process.

Implementation of audiovisual material in an early sequential bilingual model during the early years

This research arose from the need to consolidate a meaningful bilingual methodology for children from three to five years of age from low socioeconomic backgrounds belonging to the public education

The use of video resource as reinforcement of english language teaching process at elementary level addressed to professors and administrative staff of San Francisco de Asis University of La Paz city

The following Guided Project was accomplished to suggest the use of video resource as reinforcement of the English Language Teaching Process at elementary level addressed to sixteen professors and

The Comparative Effect of Using Visual and Auditory Input Enhancement on the Use of Cohesive Devices in the Writing of Iranian EFL Filed-dependent and independent Learners

Writing has been a troublesome skill for Iranian EFL learners as it needs accurate planning and acceptable coherence. The current study aimed as investigating the comparative effect of visual and

Theory and application of audiovisual materials in the English classroom




Integration of visual and linguistic information in spoken language comprehension.

To test the effects of relevant visual context on the rapid mental processes that accompany spoken language comprehension, eye movements were recorded with a head-mounted eye-tracking system while subjects followed instructions to manipulate real objects.

A trainable spoken language understanding system for visual object selection

We present a trainable, visually-grounded, spoken language understanding system. The system acquires a grammar and vocabulary from a “show-and-tell” procedure in which visual scenes are paired with

Linguistically Mediated Visual Search

It is found that when a conjunction target was identified by a spoken instruction presented concurrently with the visual display, the incremental processing of spoken language allowed the search process to proceed in a manner considerably less affected by the number of distractors.

Learning visually grounded words and syntax for a scene description task

  • D. Roy
  • Computer Science, Linguistics
    Comput. Speech Lang.
  • 2002

Eye Movements and Lexical Access in Spoken-Language Comprehension: Evaluating a Linking Hypothesis between Fixations and Linguistic Processing

The results provide evidence about the time course of lexical activation that resolves some important theoretical issues in spoken-word recognition and demonstrate that fixations are sensitive to properties of the normal language-processing system that cannot be attributed to task-specific strategies.

Grounding spatial language in perception: an empirical and computational investigation.

The authors conclude that the structure of linguistic spatial categories can be partially explained in terms of independently motivated perceptual processes.

Class-Based n-gram Models of Natural Language

This work addresses the problem of predicting a word from previous words in a sample of text and discusses n-gram models based on classes of words, finding that these models are able to extract classes that have the flavor of either syntactically based groupings or semanticallybased groupings, depending on the nature of the underlying statistics.

Improved clustering techniques for class-based statistical language modeling

Two decades of statistical language modeling: where do we go from here?

A Bayesian approach to integration of linguistic theories with data is argued for inStatistical language models estimate the distribution of various natural language phenomena for the purpose of speech recognition and other language technologies.