• Corpus ID: 232092534

MultiSubs: A Large-scale Multimodal and Multilingual Dataset

  title={MultiSubs: A Large-scale Multimodal and Multilingual Dataset},
  author={Josiah Wang and Pranava Swaroop Madhyastha and Josiel Maimoni de Figueiredo and Chiraag Lala and Lucia Specia},
  booktitle={International Conference on Language Resources and Evaluation},
This paper introduces a large-scale multimodal and multilingual dataset that aims to facilitate research on grounding words to images in their contextual usage in language. The dataset consists of images selected to unambiguously illustrate concepts expressed in sentences from movie subtitles. The dataset is a valuable resource as (i) the images are aligned to text fragments rather than whole sentences; (ii) multiple images are possible for a text fragment and a sentence; (iii) the sentences… 

Figures and Tables from this paper

xGQA: Cross-Lingual Visual Question Answering

This work provides xGQA, a new multilingual evaluation benchmark for the visual question answering task, and extends the established English GQA dataset to 7 typologically diverse languages, enabling us to detect and explore crucial challenges in cross-lingualVisual question answering.

Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models

This paper presents a qualitative study that examines the role of datasets in stimulating the leverage of visual modality and proposes methods to highlight the importance of visual signals in the datasets which demonstrate improvements in reliance of models on the source images.

Delving Deeper into Cross-lingual Visual Question Answering

This work tackles low transfer performance via novel methods that substantially reduce the gap to monolingual English performance, yielding +10 accuracy points over existing methods and conducts extensive analyses on modality biases in training data and models, aimed at understanding why zero-shot performance gaps remain for some question types and languages.

Ten Years of BabelNet: A Survey

B BabelNet is surveyed, a popular wide-coverage lexical-semantic knowledge resource obtained by merging heterogeneous sources into a unified semantic network that helps to scale tasks and applications to hundreds of languages.

A Multilingual Image-Text Fashion Dataset

The paper presents baselines for image-text classification showing that the GLAMI-1M dataset presents a challenging fine-grained classification problem: the best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy.

GLAMI-1M: A Multilingual Image-Text Fashion Dataset

We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages.

Multilingual Image Corpus – Towards a Multimodal and Multilingual Dataset

The goal of the project Multilingual Image Corpus (MIC 21) is to provide a large image dataset with annotated objects and object descriptions in 24 languages, designed both for image classification and object detection and for semantic segmentation.

IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

The Image-Grounded Language Understanding Evaluation benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly new few-shot learning setups.



Learning Translations via Images with a Massively Multilingual Image Dataset

A novel method of predicting word concreteness from images is introduced, which improves on a previous state-of-the-art unsupervised technique and allows us to predict when image-based translation may be effective, enabling consistent improvements to a state of theart text-based word translation system.

Multimodal Lexical Translation

A simple heuristic is introduced to quantify the extent of the ambiguity of a word from the distribution of its translations and use it to select subsets of the MLT Dataset which are difficult to translate.

A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions

A retrieval-based method that pivots on similar images and uses the associated captions in the target language to rerank translation outputs, compatible with any machine translation system, and allows to quickly integrate new data without the need of re-training the translation system.

Multi30K: Multilingual English-German Image Descriptions

This dataset extends the Flickr30K dataset with i) German translations created by professional translators over a subset of the English descriptions, and ii) descriptions crowdsourced independently of the original English descriptions.

Cross-Lingual Image Caption Generation

The model was designed to transfer the knowledge representation obtained from the English portion into the Japanese portion, and the resulting bilingual comparable corpus has better performance than a monolingual corpus, indicating that image understanding using a resource-rich language benefits a resources-poor language.

Show and tell: A neural image caption generator

This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.

A Corpus of Images and Text in Online News

The ION corpus contains 300K news articles published between August 2014 - 2015 in five online newspapers from two countries, anticipating their use in computer vision tasks.

Unsupervised Visual Sense Disambiguation for Verbs using Multimodal Embeddings

An unsupervised algorithm based on Lesk is proposed which performs visual sense disambiguation using textual, visual, or multimodal embeddings and VerSe, a new dataset that augments existing multimodAL datasets (COCO and TUHOI) with sense labels is introduced.

Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description

The results from the second shared task on multimodal machine translation and multilingual image description show multi-modal systems improved, but text-only systems remain competitive.

Grounded Word Sense Translation

This paper considers grounded word sense translation, i.e. the task of correctly translating an ambiguous source word given the corresponding textual and visual context, and finds that grounding on the image is specially beneficial in weaker unidirectional recurrent translation models.