LSTM Language Model Adaptation with Images and Titles for Multimedia Automatic Speech Recognition

  title={LSTM Language Model Adaptation with Images and Titles for Multimedia Automatic Speech Recognition},
  author={Yasufumi Moriya and G. Jones},
  journal={2018 IEEE Spoken Language Technology Workshop (SLT)},
  • Yasufumi Moriya, G. Jones
  • Published 1 December 2018
  • Computer Science
  • 2018 IEEE Spoken Language Technology Workshop (SLT)
Transcription of multimedia data sources is often a challenging automatic speech recognition (ASR) task. [] Key Method Our language model is tested on transcription of an existing corpus of instruction videos and on a new corpus consisting of lecture videos. Consistent reduction in perplexity by 5–10 is observed on both datasets. When the non-adapted model is combined with the image adaptation and video title adaptation models for n-best ASR hypotheses re-ranking, additionally the word error rate (WER) is…

Figures and Tables from this paper

Multimodal Grounding for Sequence-to-sequence Speech Recognition
This paper proposes novel end-to-end multimodal ASR systems and compares them to the adaptive approach by using a range of visual representations obtained from state-of-the-art convolutional neural networks and shows that adaptive training is effective for S2S models leading to an absolute improvement of 1.4% in word error rate.
Listen, Look and Deliberate: Visual Context-Aware Speech Recognition Using Pre-Trained Text-Video Representations
Novel VC-ASR approaches to leverage video and text representations extracted by a self-supervised pre-trained text-video embedding model is explored and a multi-stream attention architecture to leverage signals from both audio and video modalities is proposed.
Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations
It is shown that starting from a pretrained ASR significantly improves the state-of-the-art performance; remarkably, even when building upon a strong unimodal system, the system still finds gains by including the visual modality.
Multimodal Speaker Adaptation of Acoustic Model and Language Model for Asr Using Speaker Face Embedding
An experimental investigation shows a small improvement in word error rate for the transcription of a collection of instruction videos using adaptation of the acoustic model and the language model with fixed-length face embedding vectors.
Fine-Grained Grounding for Multimodal Speech Recognition
This paper proposes a model that uses finer-grained visual information from different parts of the image, using automatic object proposals, and finds that the model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives.
STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning
A novel multi-modal deep neural network architecture that uses speech and text entanglement for learning phonetically sound spoken-word representations, which makes this work first of its kind.
CUE Vectors: Modular Training of Language Models Conditioned on Diverse Contextual Signals
A modular framework that allows incremental, scalable training of context-enhanced LMs, and can swap one type of pretrained sentence LM for another without retraining the context encoders, by only adapting the decoder model.
AVATAR: Unconstrained Audiovisual Speech Recognition
This work proposes a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) which is trained end- to-end from spectrograms and full-frame RGB, and demonstrates the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise.
Looking Enhances Listening: Recovering Missing Speech Using Images
This paper shows that end-to-end multimodal ASR systems can become more robust to noise by leveraging the visual context, and observes that integrating visual context can result in up to 35% relative improvement in masked word recovery.
Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions
This work examines the utility of the auxiliary visual context in Multimodal Automatic Speech Recognition in adversarial settings, and shows that current methods of integrating the visual modality do not improve model robustness to noise, and the authors need better visually grounded adaptation techniques.


Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription
Recent improvements to the original YouTube automatic generation of closed captions system are described, in particular the use of owner-uploaded video transcripts to generate additional semi-supervised training data and deep neural networks acoustic models with large state inventories.
End-to-end Multimodal Speech Recognition
This paper analyzes the behavior of CTC and S2S models on noisy video data (How-To corpus), and compares it to results on the clean Wall Street Journal corpus, providing insight into the robustness of both approaches.
Show and tell: A neural image caption generator
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.
CUED-RNNLM — An open-source toolkit for efficient training and evaluation of recurrent neural network language models
An open-source toolkit which supports efficient GPU-based training of RNNLMs, and RNNLM training with a large number of word level output targets is supported, in contrast to existing tools which used class-based output-targets.
Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
This paper proposes a novel adaptive attention model with a visual sentinel that sets the new state-of-the-art by a significant margin on image captioning.
Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval
This overview article is intended to provide a thorough overview of the concepts, principles, approaches, and achievements of major technical contributions along this line of investigation.
Towards Universal Paraphrastic Sentence Embeddings
This work considers the problem of learning general-purpose, paraphrastic sentence embeddings based on supervision from the Paraphrase Database, and compares six compositional architectures, finding that the most complex architectures, such as long short-term memory (LSTM) recurrent neural networks, perform best on the in-domain data.
Spoken Content Retrieval: A Survey of Techniques and Technologies
This survey provides an overview of the field ofSCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues, and is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development.
A Simple but Tough-to-Beat Baseline for Sentence Embeddings
Recurrent neural network based language model
Results indicate that it is possible to obtain around 50% reduction of perplexity by using mixture of several RNN LMs, compared to a state of the art backoff language model.