How2: A Large-scale Dataset for Multimodal Language Understanding

  title={How2: A Large-scale Dataset for Multimodal Language Understanding},
  author={Ramon Sanabria and Ozan Caglayan and Shruti Palaskar and Desmond Elliott and Loic Barrault and Lucia Specia and Florian Metze},
Human information processing is inherently multimodal, and language is best understood in a situated context. In order to achieve human-like language processing capabilities, machines should be able to jointly process multimodal data, and not just text, images, or speech in isolation. Nevertheless, there are very few multimodal datasets to support such research, resulting in limited interaction among different research communities. In this paper, we introduce How2, a large-scale dataset of… CONTINUE READING


Publications referenced by this paper.
Showing 1-10 of 50 references

End-to-End Automatic Speech Translation of Audiobooks

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) • 2018

Similar Papers

Loading similar papers…