A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions

  title={A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions},
  author={Takuma Udagawa and Takato Yamazaki and Akiko Aizawa},
Recent models achieve promising results in visually grounded dialogues. However, existing datasets often contain undesirable biases and lack sophisticated linguistic analyses, which make it difficult to understand how well current models recognize their precise linguistic structures. To address this problem, we make two design choices: first, we focus on OneCommon Corpus (CITATION), a simple yet challenging common grounding dataset which contains minimal bias by design. Second, we analyze their… 

Maintaining Common Ground in Dynamic Environments

This work proposes a novel task setting to study the ability of both creating and maintaining common ground in dynamic environments, and collects a large-scale dataset of 5,617 dialogues to enable fine-grained evaluation and analysis of various dialogue systems.

Reference-Centric Models for Grounded Collaborative Dialogue

A grounded neural dialogue model that successfully collaborates with people in a partially-observable reference game where two agents each observe an overlapping part of a world context and need to identify and agree on some object they share is presented.

Pragmatics in Grounded Language Learning: Phenomena, Tasks, and Modeling Approaches

People rely heavily on context to enrich meaning beyond what is literally said, enabling concise but effective communication. To interact successfully and naturally with people, user-facing artificial

SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning

Experiments show that further pretraining LMs on these automatically generated data significantly improves LMs’ capability on spatial understanding, which in turn helps to better solve two external datasets, bAbI, and boolQ.

Grounding ‘Grounding’ in NLP

This work investigates the gap between definitions of “grounding” in NLP and Cognitive Science, and presents ways to both create new tasks or repurpose existing ones to make advancements towards achieving a more complete sense of grounding.

A Meta-framework for Spatiotemporal Quantity Extraction from Text

This paper formulates the NLP problem of spatiotemporal quantity extraction, and proposes the first meta-framework for solving it, which contains a formalism that decomposes the problem into several information extraction tasks, a shareable crowdsourcing pipeline, and transformer-based baseline models.

Learning Grounded Pragmatic Communication

Learning Grounded Pragmatic Communication



The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue

A baseline model for reference resolution is proposed which uses a simple method to take into account shared information accumulated in a reference chain and shows that this information is particularly important to resolve later descriptions and underline the need to develop more sophisticated models of common ground in dialogue interaction.

An Annotated Corpus of Reference Resolution for Interpreting Common Grounding

This work considers reference resolution as the central subtask of common grounding and proposes a new resource to study its intermediate process, and demonstrates the advantages of the annotation for interpreting, analyzing and improving common grounding in baseline dialogue systems.

What You See is What You Get: Visual Pronoun Coreference Resolution in Dialogues

This paper formally defines the task of visual-aware pronoun coreference resolution (PCR) and introduces VisPro, a large-scale dialogue PCR dataset, and proposes a novel visual- aware PCR model, VisCoref, for this task and conducts comprehensive experiments and case studies on the dataset.

Colors in Context: A Pragmatic Neural Model for Grounded Language Understanding

We present a model of pragmatic referring expression interpretation in a grounded communication task (identifying colors from descriptions) that draws upon predictions from two recurrent neural

Visual Dialogue without Vision or Dialogue

An embarrassingly simple method based on Canonical Correlation Analysis (CCA) that, on the standard dataset, achieves near state-of-the-art performance on mean rank (MR).

History for Visual Dialog: Do we really need it?

It is shown that co-attention models which explicitly encode dialoh history outperform models that don’t, achieving state-of-the-art performance, and a challenging subset of the VisdialVal set and the benchmark NDCG of 63%.

A Natural Language Corpus of Common Grounding under Continuous and Partially-Observable Context

This paper proposes a minimal dialogue task which requires advanced skills of common grounding under continuous and partially-observable context, and collects a largescale dataset of 6,760 dialogues which fulfills essential requirements of natural language corpora.

Talk the Walk: Navigating New York City through Grounded Dialogue

This work focuses on the task of tourist localization and develops the novel Masked Attention for Spatial Convolutions (MASC) mechanism that allows for grounding tourist utterances into the guide's map, and shows it yields significant improvements for both emergent and natural language communication.

Modality-Balanced Models for Visual Dialogue

The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue. However, via manual analysis, we find that a large

Visual Referring Expression Recognition: What Do Systems Actually Learn?

We present an empirical analysis of state-of-the-art systems for referring expression recognition – the task of identifying the object in an image referred to by a natural language expression – with