Evaluating the Evaluation of Diversity in Natural Language Generation

  title={Evaluating the Evaluation of Diversity in Natural Language Generation},
  author={Guy Tevet and Jonathan Berant},
Despite growing interest in natural language generation (NLG) models that produce diverse outputs, there is currently no principled method for evaluating the diversity of an NLG system. In this work, we propose a framework for evaluating diversity metrics. The framework measures the correlation between a proposed diversity metric and a diversity parameter, a single parameter that controls some aspect of diversity in generated text. For example, a diversity parameter might be a binary variable… 

Decoding and Diversity in Machine Translation

Distributional differences between generated and real translations are characterized, examining the cost in diversity paid for the BLEU scores enjoyed by NMT and implicates search as a salient source of known bias when translating gender pronouns.

Are Some Words Worth More than Others?

This work proposes two new intrinsic evaluation measures within the framework of a simple word prediction task that are designed to give a more holistic picture of a language model's performance, and demonstrates that the approach reveals functional differences in performance between the models that are obscured by more traditional metrics.

GenAug: Data Augmentation for Finetuning Text Generators

This paper proposes and evaluates various augmentation methods, including some that incorporate external knowledge, for finetuning GPT-2 on a subset of Yelp Reviews, and examines the relationship between the amount of augmentation and the quality of the generated text.

Decoding Methods for Neural Narrative Generation

This work employs GPT-2 and performs ablations across nucleus sampling thresholds and diverse decoding hyperparameters and analyses results over multiple criteria with automatic and human evaluation, finding that nucleus sampling is generally best with thresholds between 0.7 and 0.9 and a maximum mutual information objective can improve the quality of generated stories.

MultiTalk: A Highly-Branching Dialog Testbed for Diverse Conversations

The culminating task is a challenging theory of mind problem, a controllable generation task which requires reasoning about the expected reaction of the listener and a simple scoring algorithm, based on bipartite graph matching, to optimally incorporate a set of diverse references.

4W1H Keyword Extraction based Summarization Model

This paper proposes a new summarization method based on 4W1H keywords extraction which extracts the answer to a question corresponding to each event in QA format and applies its methods to BERT and ELECTRA models to generate a summary.

Let Your Heart Speak in its Mother Tongue: Multilingual Captioning of Cardiac Signals

This work proposes a deep neural network capable of captioning cardiac signals and demonstrates that multilingual models can outperform their monolingual counterparts, informally terming this beneficial phenomenon as the ‘blessing of multilinguality’.

DeepGen: Diverse Search Ad Generation and Real-Time Customization

DeepGen, a system deployed at web scale for automatically creating sponsored search advertisements (ads) for BingAds, leverages state-of-the-art natural language generation models to generate ads from advertiser’s web pages in an abstractive fashion and solve practical is-sues such as factuality and inference speed.

Diversifying Content Generation for Commonsense Reasoning with Mixture of Knowledge Graph Experts

This paper proposes MoKGE, a novel method that diversifies the generative reasoning by a mixture of expert (MoE) strategy on commonsense knowledge graphs (KG) to encourage various generation outputs.

Exploring diversity in back translation for low-resource machine translation

The findings show that generating back translation using nucleus sampling results in higher final model performance, and that this method of generation has high levels of both lexical and syntactic diversity.



Unifying Human and Statistical Evaluation for Natural Language Generation

This paper proposes a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated, called HUSE, which is efficiently estimated by combining human and statistical evaluation.

Language GANs Falling Short

The impact of exposure bias on sample quality is less severe than previously thought, and temperature tuning provides a better quality / diversity trade-off than adversarial training while being easier to train, easier to cross-validate, and less computationally expensive.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories

A new framework for evaluating story understanding and script learning: the `Story Cloze Test’, which requires a system to choose the correct ending to a four-sentence story, and a new corpus of 50k five- Sentence commonsense stories, ROCStories, to enable this evaluation.

The Curious Case of Neural Text Degeneration

By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.

Diversity, Density, and Homogeneity: Quantitative Characteristic Metrics for Text Collections

Experiments on real-world datasets demonstrate that the proposed characteristic metrics are highly correlated with text classification performance of a renowned model, BERT, which could inspire future applications.

ELI5: Long Form Question Answering

This work introduces the first large-scale corpus for long form question answering, a task requiring elaborate and in-depth answers to open-ended questions, and shows that an abstractive model trained with a multi-task objective outperforms conventional Seq2Seq, language modeling, as well as a strong extractive baseline.

Evaluating Text GANs as Language Models

This work proposes to approximate the distribution of text generated by a GAN, which permits evaluating them with traditional probability-based LM metrics, and shows that they currently perform substantially worse than state-of-the-art LMs.

A Simple, Fast Diverse Decoding Algorithm for Neural Generation

A simple, fast decoding algorithm that fosters diversity in neural generation by adding an inter-sibling ranking penalty and is capable of automatically adjusting its diversity decoding rates for different inputs using reinforcement learning (RL).