Vision and Language Integration Meets Multimedia Fusion

  title={Vision and Language Integration Meets Multimedia Fusion},
  author={Marie-Francine Moens and Katerina Pastra and Kate Saenko and Tinne Tuytelaars},
  journal={IEEE Multim.},
Multimodal information fusion at both the signal and semantics level is a core part of most multimedia applications, including indexing, retrieval, and summarization. Prototype systems have implemented early or late fusion of modality-specific processing results through various methodologies including rule-based approaches, informationtheoretic models, and machine learning.1 Vision and language are two of the predominant modalities that are fused, with a long history of results in TRECVid… 

Online Data Organizer: Micro-Video Categorization by Structure-Guided Multimodal Dictionary Learning

A structure-guided multi-modal dictionary learning model is built to learn the concept-level micro-video representation by jointly considering their venue structure and modality relatedness and an online learning algorithm is developed to incrementally and efficiently strengthen this model.

Multimodal Subspace Support Vector Data Description

Integrating Vision and Language for First-Impression Personality Analysis

An evaluation of the authors' proposed multimodal method using a job candidate screening system that predicted five personality traits from a short video demonstrates the methods effectiveness.



Multimodal fusion for multimedia analysis: a survey

This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia

Vision-Language Integration in AI: A Reality Check

A taxonomy of vision-language integration prototypes is presented which resulted from an extensive survey of such prototypes across a wide range of AI research areas and which uses a prototype's integration purpose as the guiding criterion for classification.

Show and tell: A neural image caption generator

This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.

Long-term recurrent convolutional networks for visual recognition and description

A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

This paper proposes to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure, to create sentence descriptions of open-domain videos with large vocabularies.

Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images

We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

The Spatial Memory Network, a novel spatial attention architecture that aligns words with image patches in the first hop, is proposed and improved results are obtained compared to a strong deep baseline model which concatenates image and question features to predict the answer.

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language

Sequence to Sequence -- Video to Text

A novel end- to-end sequence-to-sequence model to generate captions for videos that naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model.