Vision and Language Integration Meets Multimedia Fusion

  title={Vision and Language Integration Meets Multimedia Fusion},
  author={Marie-Francine Moens and Katerina Pastra and Kate Saenko and Tinne Tuytelaars},
  journal={IEEE Multim.},
Multimodal information fusion at both the signal and semantics level is a core part of most multimedia applications, including indexing, retrieval, and summarization. Prototype systems have implemented early or late fusion of modality-specific processing results through various methodologies including rule-based approaches, informationtheoretic models, and machine learning.1 Vision and language are two of the predominant modalities that are fused, with a long history of results in TRECVid… 

Multimodal Subspace Support Vector Data Description

Integrating Vision and Language for First-Impression Personality Analysis

An evaluation of the authors' proposed multimodal method using a job candidate screening system that predicted five personality traits from a short video demonstrates the methods effectiveness.

Online Data Organizer: Micro-Video Categorization by Structure-Guided Multimodal Dictionary Learning

A structure-guided multi-modal dictionary learning model is built to learn the concept-level micro-video representation by jointly considering their venue structure and modality relatedness and an online learning algorithm is developed to incrementally and efficiently strengthen this model.



Multimodal fusion for multimedia analysis: a survey

This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia

Vision-Language Integration in AI: A Reality Check

A taxonomy of vision-language integration prototypes is presented which resulted from an extensive survey of such prototypes across a wide range of AI research areas and which uses a prototype's integration purpose as the guiding criterion for classification.

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

This paper proposes to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure, to create sentence descriptions of open-domain videos with large vocabularies.

Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images

We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

The Spatial Memory Network, a novel spatial attention architecture that aligns words with image patches in the first hop, is proposed and improved results are obtained compared to a strong deep baseline model which concatenates image and question features to predict the answer.

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language

Embodied Language Processing: A New Generation of Language Technology

This paper argues that embodied cognition dictates the development of a new generation of language processing tools that bridge the gap between the symbolic and the sensorimotor representation spaces and describes that tasks and challenges such tools need to address.

COSMOROE: a cross-media relations framework for modelling multimedia dialectics

This article presents COSMOROE, a corpus-based framework for describing semantic interrelations between images, language and body movements, and argues that in viewing such relations from a message-formation perspective rather than a communicative goal one may develop a framework with descriptive power and computational applicability.

Acquiring Common Sense Spatial Knowledge through Implicit Spatial Templates

This work introduces the task of predicting spatial templates for two objects under a relationship, and presents two simple neural-based models that leverage annotated images and structured text to learn this task, demonstrating that spatial locations are to a large extent predictable from implicit spatial language.