MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities

  title={MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities},
  author={Jason Armitage and Endri Kacupaj and Golsa Tahmasebzadeh and Swati and M. Maleshkova and Ralph Ewerth and Jens Lehmann},
  journal={Proceedings of the 29th ACM International Conference on Information \& Knowledge Management},
In this paper, we introduce the MLM (Multiple Languages and Modalities) dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic data provide a resource that further tests the ability for multitask systems to learn relationships between entities. The dataset is designed for researchers and developers who build applications that perform multiple tasks on data encountered on the web and… 

Figures and Tables from this paper

Conversational Question Answering over Knowledge Graphs with Transformer and Graph Attention Networks

It is shown that LASAGNE improves the F1-score on eight out of ten question types; in some cases, the increase is more than 20% compared to state of the art (SotA); and it outperforms existing baselines averaged on all question types.

OEKG: The Open Event Knowledge Graph

This paper presents the Open Event Knowledge Graph (OEKG), a multilingual, event-centric, temporal knowledge graph composed of seven different data sets from multiple application domains, including question answering, entity recommendation and named entity recognition.

A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues

A novel priority map module is integrated into a feature-location framework that doubles the task completion rates of standalone transformers and attains state-of-the-art performance for transformer-based systems on the Touchdown benchmark for VLN.

ArtELingo: A Million Emotion Annotations of WikiArt with Emphasis on Diversity over Language and Culture

This paper introduces ArtELingo, a new benchmark and dataset, designed to encourage work on diversity across languages and cultures. Following ArtEmis, a collection of 80k artworks from WikiArt with

MM-Locate-News: Multimodal Focus Location Estimation in News

A novel dataset called Multimodal Focus Location of News (MM-Locate-News) is introduced and state-of-the-art methods on the new benchmark dataset are evaluated and novel models to predict the focus location of news using both textual and image content are suggested.

VOGUE: Answer Verbalization through Multi-Task Learning

The VOGUE framework attempts to generate a verbalized answer using a hybrid approach through a multitask learning paradigm, and it outperforms all current baselines on both BLEU and METEOR scores as evaluation metric.

GeoWINE: Geolocation based Wiki, Image, News and Event Retrieval

The GeoWINE (Geolocation-based Wiki-Image-News-Event retrieval) demonstrator is presented, an effective modular system for multimodal retrieval which expects only a single image as input.

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

The Wikipedia-based Image Text (WIT) Dataset is introduced to better facilitate multimodal, multilingual learning and represents a more diverse set of concepts and real world entities relative to what previous datasets cover.



XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

The Cross-lingual TRansfer Evaluation of Multilingual Encoders XTREME benchmark is introduced, a multi-task benchmark for evaluating the cross-lingually generalization capabilities of multilingual representations across 40 languages and 9 tasks.

From Intra-Modal to Inter-Modal Space: Multi-task Learning of Shared Representations for Cross-Modal Retrieval

This work proposes a two-stage shared representation learning framework with intra- modal optimization and subsequent cross-modal transfer learning of semantic structure that produces a robust shared representation space.

A Multi-lingual Multi-task Architecture for Low-resource Sequence Labeling

A multi-lingual multi-task architecture to develop supervised models with a minimal amount of labeled data for sequence labeling that proves to be particularly effective for low-resource settings, when there are less than 200 training sentences for the target task.

VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

This work presents a new large-scale multilingual video description dataset, VATEX, which contains over 41,250 videos and 825,000 captions in both English and Chinese and demonstrates that the spatiotemporal video context can be effectively utilized to align source and target languages and thus assist machine translation.

Enhanced representation and multi-task learning for image annotation

Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images

It is demonstrated that regularization via the addition of a high-level classification objective both improves retrieval performance to rival that of humans and enables semantic vector arithmetic.

A Survey on Multi-Task Learning

A survey for MTL from the perspective of algorithmic modeling, applications and theoretical analyses, which gives a definition of MTL and classify different MTL algorithms into five categories, including feature learning approach, low-rank approach, task clustering approach,task relation learning approach and decomposition approach.

Multi-task Learning Using Multi-modal Encoder-Decoder Networks with Shared Skip Connections

A multi-modal encoder-decoder networks to harness the multi- modal nature of multi-task scene recognition is proposed and efficiently learns a shared feature representation among all modalities in the training data.

Cross-modal Image-Text Retrieval with Multitask Learning

Two regularization terms (variance and consistency constraints) are introduced to the cross-modal embeddings such that the learned common information has large variance and is modality invariant and to enable large-scale cross- modal similarity search, a flexible binary transform network is designed to convert the text and imageembeddings into binary codes.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.