Distribution Aware Metrics for Conditional Natural Language Generation

  title={Distribution Aware Metrics for Conditional Natural Language Generation},
  author={David Chan and Yiming Ni and Austin Myers and Sudheendra Vijayanarasimhan and David A. Ross and John F. Canny},
Traditional automated metrics for evaluating conditional natural language generation use pairwise comparisons between a single generated text and the best-matching gold-standard ground truth text. When multiple ground truths are available, scores are aggregated using an average or max operation across references. While this approach works well when diversity in the ground truth data (i.e. dispersion of the distribution of conditional texts) can be ascribed to noise, such as in automated speech… 

MAUVE Scores for Generative Models: Theory and Practice

MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images, is presented, finding that the proposed scores paired with a range of f -divergences and statistical estimation methods can quantify the gaps between the distributions of human-written text and those of modern neural language models.

$IC^3$: Image Captioning by Committee Consensus

If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to approximate the reference distribution of image captions,



What’s in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

This work examines several popular visual description datasets, and examines the dataset-specific linguistic patterns that models exploit but do not generalize to new domains, finding that caption diversity is a major driving factor behind the generation of generic and uninformative captions.

Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale

These experiments show that metrics usually prefer system outputs to human-authored texts, can be insensitive to correct translations of rare words, and can yield surprisingly high scores when given a single sentence as system output for the entire test set.

Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts

This work introduces methods based on sentence mover’s similarity, and finds that sentence-based metrics correlate with human judgments significantly better than ROUGE, both on machine-generated summaries and human-authored essays.

SPICE: Semantic Propositional Image Caption Evaluation

There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram

LaMDA: Language Models for Dialog Applications

It is demonstrated that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding.

CIDEr: Consensus-based image description evaluation

A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated.

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

The surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references is reported.

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

METEOR is described, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machineproduced translation and human-produced reference translations and can be easily extended to include more advanced matching strategies.

BERTScore: Evaluating Text Generation with BERT

This work proposes BERTScore, an automatic evaluation metric for text generation that correlates better with human judgments and provides stronger model selection performance than existing metrics.

Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.