• Corpus ID: 247223128

M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal Pretraining

  title={M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal Pretraining},
  author={Xiao Dong and Xunlin Zhan and Yangxin Wu and Yunchao Wei and Michael C. Kampffmeyer and Xiaoyong Wei and Minlong Lu and Yaowei Wang and Xiaodan Liang},
Despite the potential of multi-modal pre-training to learn highly discriminative feature representations from complementary data modalities, current progress is being slowed by the lack of large-scale modality-diverse datasets. By leveraging the natural suitability of E-commerce, where different modalities capture complementary semantic information, we contribute a large-scale multi-modal pretraining dataset M5 Product. The dataset comprises 5 modalities (image, text, table, video, and audio… 


M6: A Chinese Multimodal Pretrainer
In this work, we construct the largest dataset for multimodal pretraining in Chinese, which consists of over 1.9TB images and 292GB texts that cover a wide range of domains. We propose a cross-modal
What Makes Multimodal Learning Better than Single (Provably)
This paper proves that learning with multiple modalities achieves a smaller population risk than only using its subset of modalities and shows that multi-modal learning does possess an appealing formal guarantee.
Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining
A novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE) is proposed, that excels in capturing the potential synergy between multi- modal inputs via a hybrid-stream transformer in a self-supervised manner.
Multi-modal Transformer for Video Retrieval
A multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others, and a novel framework to establish state-of-the-art results for video retrieval on three datasets.
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
In this paper, we introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as input
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
The results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.
TVQA: Localized, Compositional Video Question Answering
This paper presents TVQA, a large-scale video QA dataset based on 6 popular TV shows, and provides analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVZA task.
Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training
HELP, a novel framework for large-scale video+language omni-representation learning that achieves new state of the art on multiple benchmarks over Text-based Video/Video-moment Retrieval, Video Question Answering (QA), Video-and-language Inference and Video Captioning tasks across different domains is presented.
UNITER: UNiversal Image-TExt Representation Learning
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
Does My Multimodal Model Learn Cross-modal Interactions? It’s Harder to Tell than You Might Think!
A new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task, and recommends that researchers in multimodal machine learning report the performance not only of unimodal baselines, but also the EMAP of their best-performing model.