Shades of BLEU, Flavours of Success: The Case of MultiWOZ

  title={Shades of BLEU, Flavours of Success: The Case of MultiWOZ},
  author={Tom{\'a}s Nekvinda and Ondrej Dusek},
The MultiWOZ dataset (Budzianowski et al.,2018) is frequently used for benchmarkingcontext-to-response abilities of task-orienteddialogue systems. In this work, we identifyinconsistencies in data preprocessing and re-porting of three corpus-based metrics used onthis dataset, i.e., BLEU score and Inform &Success rates. We point out a few problemsof the MultiWOZ benchmark such as unsat-isfactory preprocessing, insufficient or under-specified evaluation metrics, or rigid database.We re-evaluate 7… 

Tables from this paper

Revisiting Markovian Generative Architectures for Efficient Task-Oriented Dialog Systems

This paper proposes to revisit Markovian Generative Architectures (MGA), which have been used in previous LSTM-based TOD systems, but not studied for PLM- based systems, and shows the efficiency advantages of the proposed Markovians PLM - based systems over their non-Markovian counterparts, in both supervised and semi-supervised settings.

Building Markovian Generative Architectures Over Pretrained LM Backbones for Efficient Task-Oriented Dialog Systems

Experiments on MultiWOZ2.1 show that in the rich-resource setting, the proposed Markov models reduce memory and time costs without performance degradation; in the low-resourceSetting, the training efficiency of the Markov model is more significant.

UBARv2: Towards Mitigating Exposure Bias in Task-Oriented Dialogs

This paper proposes session-level sampling which explicitly exposes the model to sampled generated content of dialog context during training, and employs a dropout-based consistency regularization with the masking strategy R-Mask to further improve the robustness and performance of the model.

Jointly Reinforced User Simulator and Task-oriented Dialog System with Simplified Generative Architecture

This paper proposes Simplified Generative Architectures (SGA) for DS and US respectively, both based on GPT-2 but using shortened history, and develops Jointly Reinforced US and DS, called SGA-JRUD, which achieves state-of-the-art performance on MultiWOZ2.1.

AARGH! End-to-end Retrieval-Generation for Task-Oriented Dialog

We introduce AARGH, an end-to-end task-oriented dialog system combining retrieval and generative approaches in a single model, aiming at improving dialog management and lexical diversity of outputs.

KRLS: Improving End-to-End Response Generation in Task Oriented Dialog with Reinforced Keywords Learning

A new training algorithm is proposed, KRLS, that uti-lizes Reinforcement Learning but avoids the time-consuming auto-regressive generation, and a fine-grained per-token reward function to help the model learn keywords generation more robustly.

BORT: Back and Denoising Reconstruction for End-to-End Task-Oriented Dialog

BORT achieves state-of-the-art capabilities in zero-shot domain sce- 024 narios and in low-resource scenarios and to enhance the antinoise capability of the model.

Joint Learning of Practical Dialogue Systems and User Simulators

UBAR-E, an E2E TOD system that extends an influential model; UBAR, enabling it to work on unseen dialogues by using Inferred Turn Domains as opposed to ground-truth turn domains is developed; this approach produces similar performance to UBARS in both model Score and a range of lexical richness metrics on the MultiWOZ dataset.

Mars: Semantic-aware Contrastive Learning for End-to-End Task-Oriented Dialog

It is argued that enhancing the relationship modeling between dialog context and dialog/action state is beneficial to improving the quality of the dialog state and action state, which further improves the generated response quality.

Dialogue Evaluation with Offline Reinforcement Learning

This paper shows that offline RL critics can be trained for any dialogue system as external evaluators, allowing dialogue performance comparisons across various types of systems, and has the benefit of being corpus- and model-independent, while attaining strong correlation with human judgements.



BLEU Might Be Guilty but References Are Not Innocent

This paper develops a paraphrasing task for linguists to perform on existing reference translations, which counteracts this bias and reveals that multi-reference BLEU does not improve the correlation for high quality output, and presents an alternative multi- reference formulation that is more effective.

Why We Need New Evaluation Metrics for NLG

A wide range of metrics are investigated, including state-of-the-art word-based and novel grammar-based ones, and it is demonstrated that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG.

A Call for Clarity in Reporting BLEU Scores

Pointing to the success of the parsing community, it is suggested machine translation researchers settle upon the BLEU scheme, which does not allow for user-supplied reference processing, and provide a new tool, SACREBLEU, to facilitate this.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling

The Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics is introduced, at a size of 10k dialogues, at least one order of magnitude larger than all previous annotated task-oriented corpora.

MultiWOZ 2.1: A Consolidated Multi-Domain Dialogue Dataset with State Corrections and State Tracking Baselines

This work uses crowdsourced workers to re-annotate state and utterances based on the original utterances in the dataset, and benchmark a number of state-of-the-art dialogue state tracking models on the MultiWOZ 2.1 dataset and show the joint state tracking performance on the corrected state annotations.

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

This work develops a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred and suggests improvements to the protocols for metric evaluation and system performance evaluation in machine translation.

deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets

In tasks involving generation of conversational responses, ∆BLEU correlates reasonably with human judgments and outperforms sentence-level and IBM BLEU in terms of both Spearman's ρ and Kendall’s τ.

Few-shot Natural Language Generation for Task-Oriented Dialog

FewshotWOZ is presented, the first NLG benchmark to simulate the few-shot learning setting in task-oriented dialog systems, and the proposed SC-GPT model significantly outperforms existing methods, measured by various automatic metrics and human evaluations.