LSTM Language Model Adaptation with Images and Titles for Multimedia Automatic Speech Recognition
@article{Moriya2018LSTMLM, title={LSTM Language Model Adaptation with Images and Titles for Multimedia Automatic Speech Recognition}, author={Yasufumi Moriya and G. Jones}, journal={2018 IEEE Spoken Language Technology Workshop (SLT)}, year={2018}, pages={219-226} }
Transcription of multimedia data sources is often a challenging automatic speech recognition (ASR) task. [] Key Method Our language model is tested on transcription of an existing corpus of instruction videos and on a new corpus consisting of lecture videos. Consistent reduction in perplexity by 5–10 is observed on both datasets. When the non-adapted model is combined with the image adaptation and video title adaptation models for n-best ASR hypotheses re-ranking, additionally the word error rate (WER) is…
14 Citations
Multimodal Grounding for Sequence-to-sequence Speech Recognition
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This paper proposes novel end-to-end multimodal ASR systems and compares them to the adaptive approach by using a range of visual representations obtained from state-of-the-art convolutional neural networks and shows that adaptive training is effective for S2S models leading to an absolute improvement of 1.4% in word error rate.
Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
- 2022
It is shown that starting from a pretrained ASR significantly improves the state-of-the-art performance; remarkably, even when building upon a strong unimodal system, the system still finds gains by including the visual modality.
Multimodal Speaker Adaptation of Acoustic Model and Language Model for Asr Using Speaker Face Embedding
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
An experimental investigation shows a small improvement in word error rate for the transcription of a collection of instruction videos using adaptation of the acoustic model and the language model with fixed-length face embedding vectors.
Fine-Grained Grounding for Multimodal Speech Recognition
- Computer ScienceFINDINGS
- 2020
This paper proposes a model that uses finer-grained visual information from different parts of the image, using automatic object proposals, and finds that the model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives.
STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning
- Computer SciencePAKDD
- 2021
A novel multi-modal deep neural network architecture that uses speech and text entanglement for learning phonetically sound spoken-word representations, which makes this work first of its kind.
CUE Vectors: Modular Training of Language Models Conditioned on Diverse Contextual Signals
- Computer ScienceFINDINGS
- 2022
A modular framework that allows incremental, scalable training of context-enhanced LMs, and can swap one type of pretrained sentence LM for another without retraining the context encoders, by only adapting the decoder model.
AVATAR: Unconstrained Audiovisual Speech Recognition
- Computer ScienceINTERSPEECH
- 2022
A new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) is proposed which is trained end- to-end from spectrograms and full-frame RGB and demonstrates the contribution of the visual modality on the How2 AV-ASR benchmark, and shows that the model outperforms all other prior work by a large margin.
Looking Enhances Listening: Recovering Missing Speech Using Images
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This paper shows that end-to-end multimodal ASR systems can become more robust to noise by leveraging the visual context, and observes that integrating visual context can result in up to 35% relative improvement in masked word recovery.
Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions
- Computer ScienceArXiv
- 2019
This work examines the utility of the auxiliary visual context in Multimodal Automatic Speech Recognition in adversarial settings, and shows that current methods of integrating the visual modality do not improve model robustness to noise, and the authors need better visually grounded adaptation techniques.
Eyes and Ears Together: New Task for Multimodal Spoken Content Analysis
- Computer ScienceMediaEval
- 2018
Eyes and Ears Together proposes two benchmark multimodal speech processing tasks: (1) multimodals automatic speech recognition (ASR) and (2) multi-reference co-reference resolution on the spoken multimedia.
References
SHOWING 1-10 OF 27 REFERENCES
Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription
- Computer Science2013 IEEE Workshop on Automatic Speech Recognition and Understanding
- 2013
Recent improvements to the original YouTube automatic generation of closed captions system are described, in particular the use of owner-uploaded video transcripts to generate additional semi-supervised training data and deep neural networks acoustic models with large state inventories.
End-to-end Multimodal Speech Recognition
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This paper analyzes the behavior of CTC and S2S models on noisy video data (How-To corpus), and compares it to results on the clean Wall Street Journal corpus, providing insight into the robustness of both approaches.
Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
This paper proposes a novel adaptive attention model with a visual sentinel that sets the new state-of-the-art by a significant margin on image captioning.
Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2015
This overview article is intended to provide a thorough overview of the concepts, principles, approaches, and achievements of major technical contributions along this line of investigation.
Towards Universal Paraphrastic Sentence Embeddings
- Computer ScienceICLR
- 2016
This work considers the problem of learning general-purpose, paraphrastic sentence embeddings based on supervision from the Paraphrase Database, and compares six compositional architectures, finding that the most complex architectures, such as long short-term memory (LSTM) recurrent neural networks, perform best on the in-domain data.
Spoken Content Retrieval: A Survey of Techniques and Technologies
- Computer ScienceFound. Trends Inf. Retr.
- 2012
This survey provides an overview of the field ofSCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues, and is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development.
The MGB challenge: Evaluating multi-genre broadcast media recognition
- Computer Science2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
- 2015
An evaluation focused on speech recognition, speaker diarization, and "lightly supervised" alignment of BBC TV recordings at ASRU 2015 is described, and the results obtained are summarized.
Recurrent neural network based language model
- Computer ScienceINTERSPEECH
- 2010
Results indicate that it is possible to obtain around 50% reduction of perplexity by using mixture of several RNN LMs, compared to a state of the art backoff language model.
Look, listen, and decode: Multimodal speech recognition with images
- Computer Science2016 IEEE Spoken Language Technology Workshop (SLT)
- 2016
A lattice rescoring algorithm is investigated that integrates information from the image at two different points: the image is used to augment the language model with the most likely words, and to rescore the top hypotheses using a word-level RNN.