LSTM Language Model Adaptation with Images and Titles for Multimedia Automatic Speech Recognition

  title={LSTM Language Model Adaptation with Images and Titles for Multimedia Automatic Speech Recognition},
  author={Yasufumi Moriya and G. Jones},
  journal={2018 IEEE Spoken Language Technology Workshop (SLT)},
  • Yasufumi MoriyaG. Jones
  • Published 1 December 2018
  • Computer Science
  • 2018 IEEE Spoken Language Technology Workshop (SLT)
Transcription of multimedia data sources is often a challenging automatic speech recognition (ASR) task. [] Key Method Our language model is tested on transcription of an existing corpus of instruction videos and on a new corpus consisting of lecture videos. Consistent reduction in perplexity by 5–10 is observed on both datasets. When the non-adapted model is combined with the image adaptation and video title adaptation models for n-best ASR hypotheses re-ranking, additionally the word error rate (WER) is…

Figures and Tables from this paper

Multimodal Grounding for Sequence-to-sequence Speech Recognition

This paper proposes novel end-to-end multimodal ASR systems and compares them to the adaptive approach by using a range of visual representations obtained from state-of-the-art convolutional neural networks and shows that adaptive training is effective for S2S models leading to an absolute improvement of 1.4% in word error rate.

Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations

  • Dan OneaţăH. Cucu
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
  • 2022
It is shown that starting from a pretrained ASR significantly improves the state-of-the-art performance; remarkably, even when building upon a strong unimodal system, the system still finds gains by including the visual modality.

Multimodal Speaker Adaptation of Acoustic Model and Language Model for Asr Using Speaker Face Embedding

An experimental investigation shows a small improvement in word error rate for the transcription of a collection of instruction videos using adaptation of the acoustic model and the language model with fixed-length face embedding vectors.

Fine-Grained Grounding for Multimodal Speech Recognition

This paper proposes a model that uses finer-grained visual information from different parts of the image, using automatic object proposals, and finds that the model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives.

STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning

A novel multi-modal deep neural network architecture that uses speech and text entanglement for learning phonetically sound spoken-word representations, which makes this work first of its kind.

CUE Vectors: Modular Training of Language Models Conditioned on Diverse Contextual Signals

A modular framework that allows incremental, scalable training of context-enhanced LMs, and can swap one type of pretrained sentence LM for another without retraining the context encoders, by only adapting the decoder model.

AVATAR: Unconstrained Audiovisual Speech Recognition

A new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) is proposed which is trained end- to-end from spectrograms and full-frame RGB and demonstrates the contribution of the visual modality on the How2 AV-ASR benchmark, and shows that the model outperforms all other prior work by a large margin.

Looking Enhances Listening: Recovering Missing Speech Using Images

This paper shows that end-to-end multimodal ASR systems can become more robust to noise by leveraging the visual context, and observes that integrating visual context can result in up to 35% relative improvement in masked word recovery.

Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

This work examines the utility of the auxiliary visual context in Multimodal Automatic Speech Recognition in adversarial settings, and shows that current methods of integrating the visual modality do not improve model robustness to noise, and the authors need better visually grounded adaptation techniques.

Eyes and Ears Together: New Task for Multimodal Spoken Content Analysis

Eyes and Ears Together proposes two benchmark multimodal speech processing tasks: (1) multimodals automatic speech recognition (ASR) and (2) multi-reference co-reference resolution on the spoken multimedia.



Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription

Recent improvements to the original YouTube automatic generation of closed captions system are described, in particular the use of owner-uploaded video transcripts to generate additional semi-supervised training data and deep neural networks acoustic models with large state inventories.

End-to-end Multimodal Speech Recognition

This paper analyzes the behavior of CTC and S2S models on noisy video data (How-To corpus), and compares it to results on the clean Wall Street Journal corpus, providing insight into the robustness of both approaches.

Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning

This paper proposes a novel adaptive attention model with a visual sentinel that sets the new state-of-the-art by a significant margin on image captioning.

Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval

This overview article is intended to provide a thorough overview of the concepts, principles, approaches, and achievements of major technical contributions along this line of investigation.

Towards Universal Paraphrastic Sentence Embeddings

This work considers the problem of learning general-purpose, paraphrastic sentence embeddings based on supervision from the Paraphrase Database, and compares six compositional architectures, finding that the most complex architectures, such as long short-term memory (LSTM) recurrent neural networks, perform best on the in-domain data.

Spoken Content Retrieval: A Survey of Techniques and Technologies

This survey provides an overview of the field ofSCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues, and is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development.

A Simple but Tough-to-Beat Baseline for Sentence Embeddings

The MGB challenge: Evaluating multi-genre broadcast media recognition

  • P. BellM. Gales P. Woodland
  • Computer Science
    2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
  • 2015
An evaluation focused on speech recognition, speaker diarization, and "lightly supervised" alignment of BBC TV recordings at ASRU 2015 is described, and the results obtained are summarized.

Recurrent neural network based language model

Results indicate that it is possible to obtain around 50% reduction of perplexity by using mixture of several RNN LMs, compared to a state of the art backoff language model.

Look, listen, and decode: Multimodal speech recognition with images

A lattice rescoring algorithm is investigated that integrates information from the image at two different points: the image is used to augment the language model with the most likely words, and to rescore the top hypotheses using a word-level RNN.