Constructing Multi-Modal Dialogue Dataset by Replacing Text with Semantically Relevant Images

  title={Constructing Multi-Modal Dialogue Dataset by Replacing Text with Semantically Relevant Images},
  author={Nyoungwoo Lee and Suwon Shin and Jaegul Choo and Ho-Jin Choi and Sung-Hyun Myaeng},
In multi-modal dialogue systems, it is important to allow the use of images as part of a multi-turn conversation. Training such dialogue systems generally requires a large-scale dataset consisting of multi-turn dialogues that involve images, but such datasets rarely exist. In response, this paper proposes a 45k multimodal dialogue dataset created with minimal human intervention. Our method to create such a dataset consists of (1) preparing and pre-processing text dialogue datasets, (2) creating… Expand


Multi-Modal Open-Domain Dialogue
This work studies incorporating different image fusion schemes and domain-adaptive pre-training and fine-tuning strategies, and shows that the best resulting model outperforms strong existing models in multi-modal dialogue while simultaneously performing as well as its predecessor (text-only) BlenderBot in text-based conversation. Expand
Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation
This work presents a novel task, Image Grounded Conversations (IGC), in which natural-sounding conversations are generated about a shared image, and introduces a new multiple reference dataset of crowd-sourced, event-centric conversations on images. Expand
DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset
A high-quality multi-turn dialog dataset, DailyDialog, which is intriguing in several aspects, the language is human-written and less noisy and the dialogues reflect the authors' daily communication way and cover various topics about their daily life. Expand
Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset
This work proposes a new benchmark for empathetic dialogue generation and EmpatheticDialogues, a novel dataset of 25k conversations grounded in emotional situations, and presents empirical comparisons of dialogue model adaptations forEmpathetic responding, leveraging existing models or datasets without requiring lengthy re-training of the full model. Expand
Visual Dialog
A retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response, and a family of neural encoder-decoder models, which outperform a number of sophisticated baselines. Expand
Generating Natural Questions About an Image
This paper introduces the novel task of Visual Question Generation, where the system is tasked with asking a natural and engaging question when shown an image, and provides three datasets which cover a variety of images from object-centric to event-centric. Expand
Personalizing Dialogue Agents: I have a dog, do you have pets too?
This work collects data and train models tocondition on their given profile information; and information about the person they are talking to, resulting in improved dialogues, as measured by next utterance prediction. Expand
UNITER: Learning UNiversal Image-TExt Representations
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. Expand
Visual Semantic Reasoning for Image-Text Matching
A simple and interpretable reasoning model to generate visual representation that captures key objects and semantic concepts of a scene that outperforms the current best method for image retrieval and caption retrieval on MS-COCO and Flickr30K datasets. Expand
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
A new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT), which adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. Expand