A Dataset for Document Grounded Conversations

@inproceedings{Zhou2018ADF,
  title={A Dataset for Document Grounded Conversations},
  author={Kangyan Zhou and Shrimai Prabhumoye and Alan W. Black},
  booktitle={EMNLP},
  year={2018}
}
This paper introduces a document grounded dataset for text conversations. [...] Key Method We describe two neural architectures that provide benchmark performance on the task of generating the next response. We also evaluate our models for engagement and fluency, and find that the information from the document helps in generating more engaging and fluent responses.Expand
Incremental Transformer with Deliberation Decoder for Document Grounded Conversations
TLDR
This paper designs an Incremental Transformer to encode multi-turn utterances along with knowledge in related documents and designs a two-pass decoder (Deliberation Decoder) to improve context coherence and knowledge correctness. Expand
NaturalConv: A Chinese Dialogue Dataset Towards Multi-turn Topic-driven Conversation
TLDR
A Chinese multi-turn topic-driven conversation dataset, NaturalConv, which allows the participants to chat anything they want as long as any element from the topic is mentioned and the topic shift is smooth, which should be a good benchmark for further research to evaluate the validity and naturalness of multi- turn conversation systems. Expand
Focused Attention Improves Document-Grounded Generation
TLDR
This work introduces two novel adaptations of large scale pre-trained encoder-decoder models focusing on building context driven representation of the document and enabling specific attention to the information in the document. Expand
A Compare Aggregate Transformer for Understanding Document-grounded Dialogue
TLDR
A Compare Aggregate Transformer (CAT) is proposed to jointly denoise the dialogue context and aggregate the document information for response generation and two metrics for evaluating document utilization efficiency based on word overlap are proposed. Expand
Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations
TLDR
Topical-Chat is introduced, a knowledge-grounded humanhuman conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles, to help further research in opendomain conversational AI. Expand
A Corpus of Controlled Opinionated and Knowledgeable Movie Discussions for Training Neural Conversation Models
TLDR
This work introduces a new labeled dialogue dataset in the domain of movie discussions, where every dialogue is based on pre-specified facts and opinions, and introduces as a baseline an end-to-end trained self-attention decoder model trained on this data that is able to generate opinionated responses that are judged to be natural and knowledgeable and show attentiveness. Expand
Content Selection Network for Document-grounded Retrieval-based Chatbots
TLDR
A document content selection network (CSN) is proposed to perform explicit selection of relevant document contents, and filter out the irrelevant parts, and it produces better results than the state-of-the-art approaches. Expand
DIALKI: Knowledge Identification in Conversational Systems through Dialogue-Document Contextualization
Identifying relevant knowledge to be used in conversational systems that are grounded in long documents is critical to effective response generation. We introduce a knowledge identification modelExpand
Introducing MANtIS: a novel Multi-Domain Information Seeking Dialogues Dataset
Conversational search is an approach to information retrieval (IR), where users engage in a dialogue with an agent in order to satisfy their information needs. Previous conceptual work describedExpand
On Incorporating Structural Information to improve Dialogue Response Generation
TLDR
This work proposes a new architecture that uses the ability of BERT to capture deep contextualized representations in conjunction with explicit structure and sequence information and proposes a plug-and-play Semantics-Sequences-Structures (SSS) framework which allows them to effectively combine such linguistic information. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 15 REFERENCES
The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems
This paper introduces the Ubuntu Dialogue Corpus, a dataset containing almost 1 million multi-turn dialogues, with a total of over 7 million utterances and 100 million words. This provides a uniqueExpand
A Survey of Available Corpora for Building Data-Driven Dialogue Systems
TLDR
A wide survey of publicly available datasets suitable for data-driven learning of dialogue systems is carried out and important characteristics of these datasets are discussed and how they can be used to learn diverse dialogue strategies. Expand
Frames: a corpus for adding memory to goal-oriented dialogue systems
TLDR
A rule-based baseline is proposed and the frame tracking task is proposed, which consists of keeping track of different semantic frames throughout each dialogue, and the task is analysed through this baseline. Expand
Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs
TLDR
It is argued that fictional dialogs offer a way to study how authors create the conversations but don't receive the social benefits (rather, the imagined characters do), and significant coordination across many families of function words in the large movie-script corpus is found. Expand
Get To The Point: Summarization with Pointer-Generator Networks
TLDR
A novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways, using a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator. Expand
Sequence to Sequence Learning with Neural Networks
TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier. Expand
Personalizing Dialogue Agents: I have a dog, do you have pets too?
TLDR
This work collects data and train models tocondition on their given profile information; and information about the person they are talking to, resulting in improved dialogues, as measured by next utterance prediction. Expand
Large scale evaluation of corpus-based synthesizers: results and lessons from the blizzard challenge 2005
TLDR
The Blizzard Challenge 2005 was a large scale international evaluation of various corpus-based speech synthesis systems using common datasets, the first ever to compare voices built by different systems using the same data. Expand
Effective Approaches to Attention-based Neural Machine Translation
TLDR
A global approach which always attends to all source words and a local one that only looks at a subset of source words at a time are examined, demonstrating the effectiveness of both approaches on the WMT translation tasks between English and German in both directions. Expand
OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles
TLDR
A new major release of the OpenSubtitles collection of parallel corpora, which is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages. Expand
...
1
2
...