• Corpus ID: 246863402

Delving Deeper into Cross-lingual Visual Question Answering

  title={Delving Deeper into Cross-lingual Visual Question Answering},
  author={Chen Cecilia Liu and Jonas Pfeiffer and Anna Korhonen and Ivan Vulic and Iryna Gurevych},
Visual question answering (VQA) is one of the crucial vision-and-language tasks. Yet, existing VQA research has mostly focused on the English language, due to a lack of suitable evaluation resources. Previous work on cross-lingual VQA has reported poor zero-shot transfer performance of current multilingual multimodal Transformers with large gaps to monolingual performance, without any deeper analysis. In this work, we delve deeper into the different aspects of cross-lingual VQA, aiming to… 

xGQA: Cross-Lingual Visual Question Answering

Recent advances in multimodal vision and language modeling have predominantly focused on the English language, mostly due to the lack of multilingual multimodal datasets to steer modeling efforts. In

Improving the Cross-Lingual Generalisation in Visual Question Answering

This work introduces a linguistic prior objective to augment the cross-entropy loss with a similarity-based loss to guide the model during training, and learns a task-specific subnetwork that improves cross-lingual generalisation and reduces variance without model modification.

cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation

A novel approach to knowledge distillation to train the model in other languages using parallel sentences is proposed, which can leverage an existing English model to transfer the knowledge to the target language using significantly lesser resources.

VLSP 2022 - EVJVQA Challenge: Multilingual Visual Question Answering

Details of the organization of the challenge, an overview of the methods employed by shared-task par-ticipants, and the results are presented.

Curriculum Script Distillation for Multilingual Visual Question Answering

Experimental results demonstrate that script plays a vital role in the performance of pre-trained models and that target languages that share the same script perform better than other languages and mixed-script code-switched languages performance better than their counterparts.

VLSP2022-EVJVQA Challenge: Multilingual Visual Question Answering

A benchmark dataset, including 33,000+ pairs of question-answer over three languages: Vietnamese, English, and Japanese, on approximately 5,000 images taken from Vietnam for evaluating multilingual VQA systems or models is provided.

Modular Deep Learning

A survey of modular architectures is offered, providing a unified view over several threads of research that evolved independently in the scientific literature, and various additional purposes of modularity are explored, including scaling language models, causal inference, programme induction, and planning in reinforcement learning.

xGQA: Cross-Lingual Visual Question Answering

Recent advances in multimodal vision and language modeling have predominantly focused on the English language, mostly due to the lack of multilingual multimodal datasets to steer modeling efforts. In

Towards Multi-Lingual Visual Question Answering

This paper proposes a translation-based framework to mVQA data generation that requires much less human annotation efforts than the conventional approach of directly collection questions and answers, and applies this framework to the multi-lingual captions in the Crossmodal-3600 dataset.

IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

The Image-Grounded Language Understanding Evaluation benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.

Unifying Vision-and-Language Tasks via Text Generation

This work proposes a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where the models learn to generate labels in text based on the visual and textual inputs.

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2, the first machine translation-augmented framework for cross-lingual cross-modal representation learning, is introduced, to tackle the scarcity problem of multilingual captions for image datasets and facilitate the learning of a joint embedding space of images and all languages of interest.

From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers

It is demonstrated that the inexpensive few-shot transfer (i.e., additional fine-tuning on a few target-language instances) is surprisingly effective across the board, warranting more research efforts reaching beyond the limiting zero-shot conditions.

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

The Cross-lingual TRansfer Evaluation of Multilingual Encoders XTREME benchmark is introduced, a multi-task benchmark for evaluating the cross-lingually generalization capabilities of multilingual representations across 40 languages and 9 tasks.

Don’t Stop Fine-Tuning: On Training Regimes for Few-Shot Cross-Lingual Transfer with Multilingual Language Models

This work presents a systematic study focused on a spectrum of FS-XLT fine-tuning regimes, analyzing key properties such as effectiveness, (in)stability, and modularity, and proposes to replace sequential fine- Tuning with joint fine- tuning on source and target language instances, offering consistent gains with different number of shots.

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

GVQA explicitly disentangles the recognition of visual concepts present in the image from the identification of plausible answer space for a given question, enabling the model to more robustly generalize across different distributions of answers.

Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question

The mQA model, which is able to answer questions about the content of an image, is presented, which contains four components: a Long Short-Term Memory (LSTM), a Convolutional Neural Network (CNN), an LSTM for storing the linguistic context in an answer, and a fusing component to combine the information from the first three components and generate the answer.