Fine-Grained Grounding for Multimodal Speech Recognition

  title={Fine-Grained Grounding for Multimodal Speech Recognition},
  author={Tejas Srinivasan and Ramon Sanabria and Florian Metze and Desmond Elliott},
Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it… 

Figures and Tables from this paper

Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations

  • Dan OneaţăH. Cucu
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
  • 2022
It is shown that starting from a pretrained ASR significantly improves the state-of-the-art performance; remarkably, even when building upon a strong unimodal system, the system still finds gains by including the visual modality.

Modelling word learning and recognition using visually grounded speech

The experiments show that the model is able to recognise nouns in isolation and even learns to properly differentiate between plural and singular nouns and that recognition is influenced by word competition from the word-initial cohort and neighbourhood density, mirroring word competition effects in human speech comprehension.

Modelling Human Word Learning and Recognition Using Visually Grounded Speech

The experiments show that the LSTM-VQ model is able to recognise nouns in isolation and even learns to properly differentiate between plural and singular nouns, and finds that recognition is influenced by word competition from the word-initial cohort and neighbourhood density, mirroring word competition effects in human speech comprehension.

Multimodal Speech Recognition for Language-Guided Embodied Agents

This work proposes training a multimodal ASR model to reduce errors in transcribing spoken instructions by considering the accompanying visual context, and finds that utilizing visual observations facilitates masked word recovery, with multimodAL ASR models recovering up to 30% more masked words than unimodal baselines.

Grounding ‘Grounding’ in NLP

This work investigates the gap between definitions of “grounding” in NLP and Cognitive Science, and presents ways to both create new tasks or repurpose existing ones to make advancements towards achieving a more complete sense of grounding.



Multimodal Grounding for Sequence-to-sequence Speech Recognition

This paper proposes novel end-to-end multimodal ASR systems and compares them to the adaptive approach by using a range of visual representations obtained from state-of-the-art convolutional neural networks and shows that adaptive training is effective for S2S models leading to an absolute improvement of 1.4% in word error rate.

Image-Sensitive Language Modeling for Automatic Speech Recognition

The benefits of introducing the visual modality as context information to automatic speech recognition are explored and using neural multimodal language models to rescore the recognition results of utterances that describe visual scenes is used.

Looking Enhances Listening: Recovering Missing Speech Using Images

This paper shows that end-to-end multimodal ASR systems can become more robust to noise by leveraging the visual context, and observes that integrating visual context can result in up to 35% relative improvement in masked word recovery.

Multimodal machine translation through visuals and speech

The paper concludes with a discussion of directions for future research in multimodal machine translation: the need for more expansive and challenging datasets, for targeted evaluations of model performance, and for multimodality in both the input and output space.

Large-Scale Representation Learning from Visually Grounded Untranscribed Speech

A scalable method to automatically generate diverse audio for image captioning datasets via a dual encoder that learns to align latent representations from both modalities is described and it is shown that a masked margin softmax loss for such models is superior to the standard triplet loss.

Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach

On a challenging video transcribing task, audio-visual ASR using the proposed approach gets notable improvements in terms of word error rates (WERs), compared to ASR merely using speech features.

LSTM Language Model Adaptation with Images and Titles for Multimedia Automatic Speech Recognition

This investigation extends existing ASR methods by using images and video titles to adapt a recurrent neural network (RNN) language model with a long-short term memory (LSTM) network, finding that both image adaptation and video title adaptation give the model more confidence in the choice of contextually correct informative words.

End-to-end Multimodal Speech Recognition

This paper analyzes the behavior of CTC and S2S models on noisy video data (How-To corpus), and compares it to results on the clean Wall Street Journal corpus, providing insight into the robustness of both approaches.

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

This paper presents Flickr30K Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes.

Grounded Sequence to Sequence Transduction

This article describes the How2 dataset , a large, open-domain collection of videos with transcriptions and their translations, and shows how this single dataset can be used to develop systems for a variety of language tasks and presents a number of models meant as starting points.