Multimodal Grounding for Sequence-to-Sequence Speech Recognition

  title={Multimodal Grounding for Sequence-to-Sequence Speech Recognition},
  author={Ozan Caglayan and Ramon Sanabria and Shruti Palaskar and Loic Barrault and Florian Metze},
Humans are capable of processing speech by making use of multiple sensory modalities. For example, the environment where a conversation takes place generally provides semantic and/or acoustic context that helps us to resolve ambiguities or to recall named entities. Motivated by this, there have been many works studying the integration of visual information into the speech recognition pipeline. Specifically, in our previous work, we propose a multistep visual adaptive training approach which… CONTINUE READING

From This Paper

Figures and tables from this paper.


Publications referenced by this paper.
Showing 1-10 of 30 references

Adam: A Method for Stochastic Optimization

View 2 Excerpts
Highly Influenced

The Kaldi Speech Recognition Toolkit

View 3 Excerpts
Highly Influenced

End-to-end Multimodal Speech Recognition

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) • 2018
View 5 Excerpts

LSTM language model adaptation with images and titles for multimedia automatic speech recognition

Yasufumi Moriya, Gareth J.F. Jones
Spoken Language Technology Workshop (SLT), 2018 IEEE. IEEE, 2018. • 2018
View 1 Excerpt

Places: A 10 Million Image Database for Scene Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence • 2018
View 2 Excerpts

Similar Papers

Loading similar papers…