Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

  title={Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning},
  author={Elad Amrani and Rami Ben-Ari and Daniel Rotman and Alexander M. Bronstein},
One of the key factors of enabling machine learning models to comprehend and solve real-world tasks is to leverage multimodal data. Unfortunately, annotation of multimodal data is challenging and expensive. Recently, self-supervised multimodal methods that combine vision and language were proposed to learn multimodal representations without annotation. However, these methods often choose to ignore the presence of high levels of noise and thus yield sub-optimal results. In this work, we show… 

Figures and Tables from this paper

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

A self-supervised training framework that learns a common multimodal embedding space that enforces a grouping of semantically similar instances that enables retrieval of samples across all modalities, even from unseen datasets and different domains is proposed.

COBRA: Contrastive Bi-Modal Representation Algorithm

A novel framework COBRA is presented that aims to train two modalities in a joint fashion inspired by the Contrastive Predictive Coding and Noise Contrastive Estimation paradigms which preserve both inter and intra-class relationships and reduces the modality gap significantly.

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

This work proposes to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision, and introduces iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.

Support-set bottlenecks for video-text representation learning

This paper proposes a novel method that leverages a generative model to naturally push related samples together, and results in representations that explicitly encode semantics shared between samples, unlike noise contrastive learning.

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Hierarchical Transformer (HiT) is proposed, which performs Hierarchical Cross-modal Contrastive Matching in both feature-level and semantic-level, achieving multi-view and comprehensive retrieval results and is inspired by MoCo.

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

An end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets and yields state-of-the-art results on standard downstream video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.

Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

This paper focuses on multilingual text-to-video search and proposes a Transformer-based model that learns contextual multilingual multimodal embeddings and significantly improves video search in non-English languages without additional annotations.

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

This work introduces Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs, and performs analysis of AVLnet's learned representations, showing the model has learned to relate visual objects with salient words and natural sounds.

Look Before you Speak: Visually Contextualized Utterances

A new visually conditioned Future Utterance Prediction task that involves predicting the next utterance in a video, using both visual frames and transcribed speech as context, which consists of a novel co-attentional multimodal video transformer that outperforms baselines that use textual inputs alone.

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval




Learning Representations for Multimodal Data with Deep Belief Nets

The experimental results on bi-modal data consisting of images and text show that the Multimodal DBN can learn a good generative model of the joint space of image and text inputs that is useful for lling in missing data so it can be used both for image annotation and image retrieval.

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

This paper proposes a novel framework that simultaneously utilizes multi-modal features (different visual characteristics, audio inputs, and text) by a fusion strategy for efficient retrieval and explores several loss functions in training the embedding.

Learning to Learn From Noisy Labeled Data

This work proposes a noise-tolerant training algorithm, where a meta-learning update is performed prior to conventional gradient update, and trains the model such that after one gradient update using each set of synthetic noisy labels, the model does not overfit to the specific noise.

Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification

This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

This work proposes a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training and demonstrates significant improvements and outperforms previously reported methods on both text-to-video and video- to-text retrieval tasks.

Learning from Noisy Labels with Distillation

This work proposes a unified distillation framework to use “side” information, including a small clean dataset and label relations in knowledge graph, to “hedge the risk” of learning from noisy labels, and proposes a suite of new benchmark datasets to evaluate this task in Sports, Species and Artifacts domains.

Context Encoders: Feature Learning by Inpainting

It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.

CleanNet: Transfer Learning for Scalable Image Classifier Training with Label Noise

CleanNet, a joint neural embedding network, which only requires a fraction of the classes being manually verified to provide the knowledge of label noise that can be transferred to other classes is introduced, which can reduce label noise detection error rate on held-out classes where no human supervision available.

VideoBERT: A Joint Model for Video and Language Representation Learning

This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.