Shades of BLEU, Flavours of Success: The Case of MultiWOZ

  title={Shades of BLEU, Flavours of Success: The Case of MultiWOZ},
  author={Tom{\'a}s Nekvinda and Ondrej Dusek},
The MultiWOZ dataset (Budzianowski et al.,2018) is frequently used for benchmarkingcontext-to-response abilities of task-orienteddialogue systems. In this work, we identifyinconsistencies in data preprocessing and re-porting of three corpus-based metrics used onthis dataset, i.e., BLEU score and Inform &Success rates. We point out a few problemsof the MultiWOZ benchmark such as unsat-isfactory preprocessing, insufficient or under-specified evaluation metrics, or rigid database.We re-evaluate 7… 

Tables from this paper

