Spatio-Temporal Attention Models for Grounded Video Captioning


Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video… (More)
DOI: 10.1007/978-3-319-54190-7_7


