Vision and Language Integration Meets Multimedia Fusion

Multimodal information fusion at both the signal and semantics level is a core part of most multimedia applications, including indexing, retrieval, and summarization. Prototype systems have implemented early or late fusion of modality-specific processing results through various methodologies including rule-based approaches, informationtheoretic models, and machine learning.1 Vision and language are two of the predominant modalities that are fused, with a long history of results in TRECVid… 

