• Corpus ID: 237099299

On the Need for Thoughtful Data Collection for Multi-Party Dialogue: A Survey of Available Corpora and Collection Methods

  title={On the Need for Thoughtful Data Collection for Multi-Party Dialogue: A Survey of Available Corpora and Collection Methods},
  author={Khyati Mahajan and Samira Shaikh},
We present a comprehensive survey of available corpora for multi-party dialogue. We survey over 300 publications related to multi-party dialogue and catalogue all available corpora in a novel taxonomy. We analyze methods of data collection for multi-party dialogue corpora and identify several lacunae in existing data collection approaches used to collect such dialogue. We present this survey, the first survey to focus exclusively on multi-party dialogue corpora, to motivate research in this… 
1 Citations

Figures and Tables from this paper

Bazinga! A Dataset for Multi-Party Dialogues Structuring
A dataset built around a large collection of TV (and movie) series filled with challenging multi-party dialogues, Bazinga! is introduced, providing a baseline for speaker diarization, punctuation restoration, and person entity recognition.


A corpus for studying addressing behaviour in multi-party dialogues
A multi-modal corpus of hand-annotated meeting dialogues that was designed for studying addressing behaviour in face-to-face conversations and the analysis of the reproducibility and stability of the annotation scheme is described.
A Survey of Available Corpora for Building Data-Driven Dialogue Systems
A wide survey of publicly available datasets suitable for data-driven learning of dialogue systems is carried out and important characteristics of these datasets are discussed and how they can be used to learn diverse dialogue strategies.
MPC: A Multi-Party Chat Corpus for Modeling Social Phenomena in Discourse
This effort is part of a larger project to develop computational models of social phenomena such as agenda control, influence, and leadership in on-line interactions to help capturing the dialogue dynamics that are essential for developing realistic human-machine dialogue systems, including autonomous virtual chat agents.
Survey on evaluation methods for dialogue systems
This paper distinguishes between the various classes of dialogue systems (task-oriented, conversational, and question-answering dialogue systems) by introducing the main technologies developed for the dialogue systems and then presenting the evaluation methods regarding that class.
Extending the MPC corpus to Chinese and Urdu - A Multiparty Multi-Lingual Chat Corpus for Modeling Social Phenomena in Language
This work builds a multi-lingual multi-party online chat corpus in order to develop a firm understanding in a set of social constructs such as agenda control, influence, and leadership as well as to computationally model such constructs in online interactions.
The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems
This paper introduces the Ubuntu Dialogue Corpus, a dataset containing almost 1 million multi-turn dialogues, with a total of over 7 million utterances and 100 million words. This provides a unique
MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization
This paper introduces MediaSum, a large-scale media interview dataset consisting of 463.6K transcripts with abstractive summaries that can be used in transfer learning to improve a model’s performance on other dialogue summarization tasks.
Automatic Construction of Discourse Corpora for Dialogue Translation
A novel approach is proposed to automatically construct parallel discourse corpus for dialogue machine translation by mapping monolingual discourse to bilingual texts via an information retrieval approach and integrating speaker information into the translation.
Towards online speech summarization
A novel method for weighting dialogue acts using only very limited local context is introduced, and it is shown that high summary precision is possible even when information about the meeting as a whole is lacking.
Training End-to-End Dialogue Systems with the Ubuntu Dialogue Corpus
In this paper, we construct and train end-to-end neural network-based dialogue systems using an updated version of the recent Ubuntu Dialogue Corpus, a dataset containing almost 1 million multi-turn