• Corpus ID: 207852924

An Annotation Scheme of A Large-scale Multi-party Dialogues Dataset for Discourse Parsing and Machine Comprehension

  title={An Annotation Scheme of A Large-scale Multi-party Dialogues Dataset for Discourse Parsing and Machine Comprehension},
  author={Jiaqi Li and Ming Liu and Bing Qin and Zihao Zheng and Ting Liu},
In this paper, we propose the scheme for annotating large-scale multi-party chat dialogues for discourse parsing and machine comprehension. The main goal of this project is to help understand multi-party dialogues. Our dataset is based on the Ubuntu Chat Corpus. For each multi-party dialogue, we annotate the discourse structure and question-answer pairs for dialogues. As we know, this is the first large scale corpus for multi-party dialogues discourse parsing, and we firstly propose the task… 

Figures and Tables from this paper


Discourse parsing for multi-party chat dialogues
This paper presents the first ever, to the best of the knowledge, discourse parser for multi-party chat dialogues, using the dependency parsing paradigm as has been done in the past (Muller et al., 2012; Li et al, 2014).
The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems
This paper introduces the Ubuntu Dialogue Corpus, a dataset containing almost 1 million multi-turn dialogues, with a total of over 7 million utterances and 100 million words. This provides a unique
A Deep Sequential Model for Discourse Parsing on Multi-Party Dialogues
A deep sequential model for parsing discourse dependency structures of multi-party dialogues by predicting dependency relations and constructing the discourse structure jointly and alternately is presented.
Discourse Structure and Dialogue Acts in Multiparty Dialogue: the STAC Corpus
The STAC resource, a corpus of multi-party chats annotated for discourse structure in the style of SDRT, is described, a rich source of data on strategic conversation, but also the first corpus that is aware of that provides full discourse structures for multi- party dialogues.
GSN: A Graph-Structured Network for Multi-Party Dialogues
The core of GSN is a graph-based encoder that can model the information flow along the graph-structured dialogues (two-party sequential dialogues are a special case) and Experimental results show that GSN significantly outperforms existing sequence-based models.
Keep Meeting Summaries on Topic: Abstractive Multi-Modal Meeting Summarization
An abstractive meeting summarizer from both videos and audios of meeting recordings is developed, which significantly outperforms the state-of-the-art with both BLEU and ROUGE measures.
CoQA: A Conversational Question Answering Challenge
CoQA is introduced, a novel dataset for building Conversational Question Answering systems and it is shown that conversational questions have challenging phenomena not present in existing reading comprehension datasets (e.g., coreference and pragmatic reasoning).
QuAC: Question Answering in Context
QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context, as it shows in a detailed qualitative evaluation.
The Penn Discourse TreeBank 2.0
We present the second version of the Penn Discourse Treebank, PDTB-2.0, describing its lexically-grounded annotations of discourse relations and their two abstract object arguments over the 1 million
WikiQA: A Challenge Dataset for Open-Domain Question Answering
The WIKIQA dataset is described, a new publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering, which is more than an order of magnitude larger than the previous dataset.