Does My Multimodal Model Learn Cross-modal Interactions? It’s Harder to Tell than You Might Think!

  title={Does My Multimodal Model Learn Cross-modal Interactions? It’s Harder to Tell than You Might Think!},
  author={Jack Hessel and Lillian Lee},
Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However, sometimes high-performing black-box algorithms turn out to be mostly exploiting unimodal signals in the data. We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task. This function projection modifies model predictions so that… 

Figures and Tables from this paper

Perceptual Score: What Data Modalities Does Your Model Perceive?

The perceptual score is introduced, a metric that assesses the degree to which a model relies on the different subsets of the input features, i.e .

SHAPE: An Unified Approach to Evaluate the Contribution and Cooperation of Individual Modalities

The SHapley vAlue-based PErceptual (SHAPE) scores are proposed that measure the marginal contribution of individual modalities and the degree of cooperation across modalities to improve the understanding of how the present multi-modal models operate on differentmodalities and encourage more sophisticated methods of integrating multiple modalities.

Can We Use Small Models to Investigate Multimodal Fusion Methods?

This work proposes the idea of studying multimodal fusion methods in a smaller setting with small models and datasets and finds that some results for fusion methods from the larger domain translate to the math arithmetics sandbox, indicating a promising future avenue for multi-model prototyping.

DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations

DIME enables accurate and fine-grained analysis of multimodal models while maintaining generality across arbitrary modalities, model architectures, and tasks, and presents a step towards debugging and improving these models for real-world deployment.

MultiViz: An Analysis Benchmark for Visualizing and Understanding Multimodal Models

The complementary stages in MULTIVIZ together enable users to simulate model predictions, assign interpretable concepts to features, perform error analysis on model misclassifications, and use insights from error analysis to debug models.

MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound

MERLOT Reserve is introduced, a model that represents videos jointly over time – through a new training objective that learns from audio, subtitles, and video frames, which enables out-of-the-box prediction, revealing strong multimodal commonsense understanding.

Modality-specific Learning Rates for Effective Multimodal Additive Late-fusion

A Modality-Specific Learning Rate (MSLR) method to effectively build late-fusion multimodal models from fine-tuned unimmodal models is proposed and experiments show that MSLR outperforms global learning rates on multiple tasks and settings, and enables the models to effectively learn each modality.

Do Androids Laugh at Electric Sheep? Humor"Understanding"Benchmarks from The New Yorker Caption Contest

This work challenges AI models to “demonstrate un-derstanding” of the sophisticated multimodal humor of The New Yorker Caption Contest, and investigates vision-and-language models that take as input the cartoon pixels and caption directly, as well as language-only models for which the authors circumvent image-processing by providing textual descriptions of the image.

Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks

An algorithm is proposed to balance the conditional learning speeds between modalities during training and it is demonstrated that it indeed addresses the issue of greedy learning.



LXMERT: Learning Cross-Modality Encoder Representations from Transformers

The LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework, a large-scale Transformer model that consists of three encoders, achieves the state-of-the-art results on two visual question answering datasets and shows the generalizability of the pre-trained cross-modality model.

A Co-Memory Network for Multimodal Sentiment Analysis

A novel co-memory network is proposed to iteratively model the interactions between visual contents and textual words for multimodal sentiment analysis, and demonstrates the effectiveness of the proposed model compared to the state-of-the-art methods.

A decision-theoretic generalization of on-line learning and an application to boosting

The model studied can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting, and it is shown that the multiplicative weight-update Littlestone?Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems.

Categorizing and Inferring the Relationship between the Text and Image of Twitter Posts

It is shown that by combining the text and image information, a machine learning approach is built that accurately distinguishes between the relationship types and is directly used in end-user applications to optimize screen estate.

MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis

A deep semantic network, namely MultiSentiNet, is proposed for multimodal sentiment analysis and a visual feature guided attention LSTM model is proposed to extract words that are important to understand the sentiment of whole tweet and aggregate the representation of those informative words with visual semantic features, object and scene.

Other common cases of INTR → IDTN are "happy birthday" messages coupled with images of their intended recipients and selfies taken at events (e.g., sports games)