GEM: A General Evaluation Benchmark for Multimodal Tasks

  title={GEM: A General Evaluation Benchmark for Multimodal Tasks},
  author={Lin Su and Nan Duan and Edward Cui and Lei Ji and Chenfei Wu and Huaishao Luo and Yongfei Liu and Ming Zhong and Taroon Bharti and Arun Sacheti},
In this paper, we present GEM1 as a General Evaluation benchmark for Multimodal tasks. Different from existing datasets such as GLUE (Wang et al., 2018), SuperGLUE (Wang et al., 2019), XGLUE (Liang et al., 2020) and XTREME (Hu et al., 2020) that mainly focus on natural language tasks, GEM is a largescale vision-language benchmark, which consists of GEM-I for image-language tasks and GEM-V for video-language tasks. Comparing with existing multimodal datasets such as MSCOCO (Chen et al., 2015… 

Figures and Tables from this paper

IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages
The Image-Grounded Language Understanding Evaluation benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly new few-shot learning setups.
xGQA: Cross-Lingual Visual Question Answering
This work provides xGQA, a new multilingual evaluation benchmark for the visual question answering task, and extends the established English GQA dataset to 7 typologically diverse languages, enabling us to detect and explore crucial challenges in cross-lingualVisual question answering.
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
The VLUE benchmark is introduced, a multi-task multi-dimension benchmark for evaluating the generalization capabilities and the efficiency-performance trade-off (“Pareto SOTA”) of VLP models, and it is demonstrated that there is a sizable generalization gap for all V LP models when testing on out-of-distribution test sets annotated on images from a more diverse distribution that spreads across cultures.
Delving Deeper into Cross-lingual Visual Question Answering
This work tackles low transfer performance via novel methods that substantially reduce the gap to monolingual English performance, yielding +10 accuracy points over existing methods and conducts extensive analyses on modality biases in training data and models, aimed at understanding why zero-shot performance gaps remain for some question types and languages.
Automatic Medical Text Simplification: Challenges of Data Quality and Curation
It is proposed that careful crowd-sourcing for medical text simplification is possible, when combined with automatic data labeling, a well-designed expert-layman collaboration framework, and context-dependent crowd-Sourcing instructions, to leverage careful crowdsourcing.


Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
After pretraining on large-scale image-caption pairs, Unicoder-VL is transferred to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer, and shows the powerful ability of the cross-modal pre-training.
XGLUE: A New Benchmark Datasetfor Cross-lingual Pre-training, Understanding and Generation
A recent cross-lingual pre-trained model Unicoder is extended to cover both understanding and generation tasks, which is evaluated on XGLUE as a strong baseline and the base versions of Multilingual BERT, XLM and XLM-R are evaluated for comparison.
UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.
UNITER: Learning UNiversal Image-TExt Representations
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings.
GLGE: A New General Language Generation Evaluation Benchmark
The General Language Generation Evaluation (GLGE), a new multi-task benchmark for evaluating the generalization capabilities of NLG models across eight language generation tasks, is presented and a leaderboard with strong baselines including MASS, BART, and ProphetNet is built.
TVQA: Localized, Compositional Video Question Answering
This paper presents TVQA, a large-scale video QA dataset based on 6 popular TV shows, and provides analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVZA task.
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
This work introduces ActivityNet-QA, a fully annotated and large scale VideoQA dataset, which consists of 58,000 QA pairs on 5,800 complex web videos derived from the popular ActivityNet dataset and explores various video representation strategies to improve videoQA performance.
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization
The Cross-lingual TRansfer Evaluation of Multilingual Encoders XTREME benchmark is introduced, a multi-task benchmark for evaluating the cross-lingually generalization capabilities of multilingual representations across 40 languages and 9 tasks.
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
This paper proposes a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments.