SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations

  title={SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations},
  author={Satwik Kottur and Seungwhan Moon and Alborz Geramifard and Babak Damavandi},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
Next generation task-oriented dialog systems need to understand conversational contexts with their perceived surroundings, to effectively help users in the real-world multimodal environment. Existing task-oriented dialog datasets aimed towards virtual assistance fall short and do not situate the dialog in the user’s multimodal context. To overcome, we present a new dataset for Situated and Interactive Multimodal Conversations, SIMMC 2.0, which includes 11K task-oriented user<->assistant dialogs… 

Figures and Tables from this paper

Navigating Connected Memories with a Task-oriented Dialog System

This work proposes dialogs for connected memories as a powerful tool to empower users to search their media collection through a multi-turn, interactive conversation, and collects a new task-oriented dialog dataset COMET, which contains 11.5k user → assistant dialogs (totalling 103k utterances), grounded in simulated personal memory graphs.

Multimodal Interactions Using Pretrained Unimodal Models for SIMMC 2.0

This paper first pretrain the multimodal model to understand the relationship between image and text, then finetune the authors' model for each task, and achieves the 3rd best performance in subtask, and a runner-up in the generation of subtask \#4.

DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue

This paper presents DVD, a Diagnostic Dataset for Video-grounded Dialogue, designed to contain minimal biases and has detailed annotations for the different types of reasoning over the spatio-temporal space of video.

Which One Are You Referring To? Multimodal Object Identification in Situated Dialogue

Three methods to tackle the problem of interpreting multimodal inputs from conversational and situational contexts on the largest situated dialogue dataset, SIMMC 2.1 are explored and the best method, scene-dialogue alignment, improves the performance by ~20% F1-score compared to the SIMMC2.1 baselines.

Dialog Acts for Task Driven Embodied Agents

This work proposes a set of dialog acts for modelling such dialogs and annotates the TEACh dataset that includes over 3,000 situated, task oriented conversations with dialog acts and demonstrates the use of this annotated dataset in training models for tagging the dialog acts of a given utterance.

Tell Your Story: Task-Oriented Dialogs for Interactive Content Creation

This work proposes task-oriented dialogs for montage creation as a novel interactive tool to seamlessly search, compile, and edit montages from a media collection, and collects a new dataset C3, which contains 10 k dialogs conditioned on media montages simulated from a large media collection.

SPRING: Situated Conversation Agent Pretrained with Multimodal Questions from Incremental Layout Graph

A Situated Conversation Agent Petrained with Multimodal Questions from INcremental Layout Graph (SPRING) with abilities of reasoning multi-hops spatial relations and connecting them with visual attributes in crowded situated scenarios and significantly outperforms state-of-the-art approaches on both SIMMC 1.0 and SIMMC 2.0 datasets.

UNITER-Based Situated Coreference Resolution with Rich Multimodal Input

Results show that the proposed approach outperforms the official DSTC10 baseline substantially, with the object F1 score boosted from 36.6% to 77.3% on the development set, demonstrating the effectiveness of the proposed object representations from rich multimodal input.

Multimodal Conversational AI: A Survey of Datasets and Approaches

This paper motivates, defines, and mathematically formulates the multimodal conversational research objective, and provides a taxonomy of research required to solve the objective: multi-modality representation, fusion, alignment, translation, and co-learning.

Building Goal-Oriented Dialogue Systems with Situated Visual Context

A novel multimodal conversational framework where the dialogue agent's next action and their arguments are derived jointly conditioned both on the conversational and the visual context is proposed.



Situated and Interactive Multimodal Conversations

Situated Interactive MultiModal Conversations (SIMMC) is introduced as a new direction aimed at training agents that take multimodal actions grounded in a co-evolving multimodals input context in addition to the dialog history.

Joint Generation and Bi-Encoder for Situated Interactive MultiModal Conversations

An end-to-end encoder-decoder model based on BART for generating outputs of action prediction, response generation, and dialogue state tracking tasks in a single string, and another modelbased on Bi-encoders for response retrieval task that significantly outperformed the other entries of the challenge’s official evaluation.

Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset

This work introduces the the Schema-Guided Dialogue (SGD) dataset, containing over 16k multi-domain conversations spanning 16 domains, and presents a schema-guided paradigm for task-oriented dialogue, in which predictions are made over a dynamic set of intents and slots provided as input.

Audio Visual Scene-Aware Dialog Track in DSTC8

A new challenge task and dataset for Audio Visual Scene-Aware Dialog (AVSD) in DSTC7, which was the first attempt to combine conversation and multimodal video description into a single end-to-end differentiable network to build scene-aware dialog systems, is proposed.

MultiWOZ 2.1: Multi-Domain Dialogue State Corrections and State Tracking Baselines

This work uses crowdsourced workers to fix the state annotations and utterances in the original version of the MultiWOZ data, hoping that this dataset resource will allow for more effective dialogue state tracking models to be built in the future.

Overview of the Ninth Dialog System Technology Challenge: DSTC9

The task definition is described, provided datasets, baselines and evaluation set-up for each track, and the results of the submitted systems are summarized to highlight the overall trends of the state-of-the-art technologies for the tasks.

A Simple Language Model for Task-Oriented Dialogue

SimpleTOD is a simple approach to task-oriented dialogue that uses a single causal language model trained on all sub-tasks recast as a single sequence prediction problem, which allows it to fully leverage transfer learning from pre-trained, open domain, causal language models such as GPT-2.

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

This work develops CLEVR-Dialog, a large diagnostic dataset for studying multi-round reasoning in visual dialog, and constructs a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset, resulting in a dataset where all aspects of the visual dialog are fully annotated.

Multi-Task Learning for Situated Multi-Domain End-to-End Dialogue Systems

This paper uses multitask learning techniques to train a GPT-2 based model on a more challenging dataset with multiple domains, multiple modalities, and more diversity in output formats and achieves better performance on all sub-tasks, across domains, compared to task and domain-specific models.

Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems

A training procedure to simulate token-level decoding to improve the quality of generated responses during inference and a proposed Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities.