• Corpus ID: 239998430

Perceptual Score: What Data Modalities Does Your Model Perceive?

  title={Perceptual Score: What Data Modalities Does Your Model Perceive?},
  author={Itai Gat and Idan Schwartz and Alexander G. Schwing},
Machine learning advances in the last decade have relied significantly on largescale datasets that continue to grow in size. Increasingly, those datasets also contain different data modalities. However, large multi-modal datasets are hard to annotate, and annotations may contain biases that we are often unaware of. Deep-net-based classifiers, in turn, are prone to exploit those biases and to find shortcuts. To study and quantify this concern, we introduce the perceptual score, a metric that… 

Figures and Tables from this paper

SHAPE: An Unified Approach to Evaluate the Contribution and Cooperation of Individual Modalities
This paper proposes the SHAPE scores, scores that measure the marginal contribution of individual modalities and the degree of cooperation across modalities that can help improve the understanding of how the present multi-modal models operate on differentmodalities and encourage more sophisticated methods of integrating multiple modalities.
ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic
This work repurposes text-to-image matching models to generate a descrip-tive text given an image at inference time, without any further training or tuning step, and demonstrates its ability to perform image arithmetic in which the inputs can be either images or text and the output is a sentence.
Latent Space Explanation by Intervention
This study aims to reveal hidden concepts by employing an intervention mechanism that shifts the predicted class based on discrete variational autoencoders by determining the concepts that can alter the class, hence providing interpretability.
Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks
This work proposes an algorithm to balance the conditional learning speeds between modalities during training and demonstrates that it indeed addresses the issue of greedy learning.


Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies
A novel regularization term based on the log-Sobolev inequality, which bounds the functional entropy with the functional-Fisher-information, which maximizes the amount of information that the modalities contribute.
Does My Multimodal Model Learn Cross-modal Interactions? It’s Harder to Tell than You Might Think!
A new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task, and recommends that researchers in multimodal machine learning report the performance not only of unimodal baselines, but also the EMAP of their best-performing model.
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
This work adapts the recently proposed ViLBERT model for multi-turn visually-grounded conversations and finds that additional finetuning using "dense" annotations in VisDial leads to even higher NDCG but hurts MRR, highlighting a trade-off between the two primary metrics.
Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases
This paper trains a naive model that makes predictions exclusively based on dataset biases, and a robust model as part of an ensemble with the naive one in order to encourage it to focus on other patterns in the data that are more likely to generalize.
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
This work balances the popular VQA dataset by collecting complementary images such that every question in this balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.
RUBi: Reducing Unimodal Biases in Visual Question Answering
RUBi, a new learning strategy to reduce biases in any VQA model, is proposed, which reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image.
Turning a Blind Eye: Explicit Removal of Biases and Variation from Deep Neural Network Embeddings
It is demonstrated on this dataset, for a number of facial attribute classification tasks, that the algorithm can be used to remove racial biases from the network feature representation.
Revisiting Visual Question Answering Baselines
The results suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers, and a simple alternative model based on binary classification is developed.
Are VQA Systems RAD? Measuring Robustness to Augmented Data with Focused Interventions
A new robustness measure, Robustness to Augmented Data (RAD), is proposed, which measures the consistency of model predictions between original and augmented examples, and can quantify when state-of-the-art systems are not robust to counterfactuals.
Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in Deep Image Representations
It is shown that trained models significantly amplify the association of target labels with gender beyond what one would expect from biased datasets, and an adversarial approach is adopted to remove unwanted features corresponding to protected variables from intermediate representations in a deep neural network.