The statistical advantage of automatic NLG metrics at the system level

@inproceedings{Wei2021TheSA,
  title={The statistical advantage of automatic NLG metrics at the system level},
  author={Johnny Tian-Zheng Wei and Robin Jia},
  booktitle={ACL},
  year={2021}
}
Estimating the expected output quality of generation systems is central to NLG. This paper qualifies the notion that automatic metrics are not as good as humans in estimating system-level quality. Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators. We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap. Measuring this error is complicated: predictions are evaluated… 

Figures and Tables from this paper

Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics

TLDR
This work identifies two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice and proposes changes to rectify this disconnect.

Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees

TLDR
This work proposes a combination of the existing paradigms, sampling responses to be scored by humans intelligently, and proposes reward sampling and observes significant gains in accuracy and quadratic weighted kappa (QWK) with a relatively small human budget.

Toward More Effective Human Evaluation for Machine Translation

TLDR
Using a sampling approach, it is demonstrated that information from document membership and automatic metrics can help improve estimates compared to a pure random sampling baseline and achieves gains of up to 20% in average absolute error by leveraging stratified sampling and control variates.

Learning to Rank Visual Stories From Human Ranking Data

TLDR
This paper develops Vrank (VIST Ranker), a novel reference-free VIST metric for story evaluation that shows the superiority of Vrank by its generalizability to pure textual stories, and concludes that this reuse of human evaluation results puts Vrank in a strong position for continued future advances.

Question-Based Salient Span Selection for More Controllable Text Summarization

TLDR
A method for incorporating question-answering (QA) signals into a summarization model that identifies salient noun phrases in the input document by automatically generating wh-questions that are answered by the NPs and automatically determining whether those questions are answered in the gold summaries.

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

TLDR
This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years and lays out a long-term vision for NLG evaluation and proposes concrete steps to improve their evaluation processes.

References

SHOWING 1-10 OF 53 REFERENCES

The price of debiasing automatic metrics in natural language evalaution

TLDR
This paper uses control variates to combine automatic metrics with human evaluation to obtain an unbiased estimator with lower cost than human evaluation alone, but in practice this means only a 7-13% cost reduction on evaluating summarization and open-response question answering systems.

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

TLDR
This work develops a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred and suggests improvements to the protocols for metric evaluation and system performance evaluation in machine translation.

Is all that Glitters in Machine Translation Quality Estimation really Gold?

TLDR
A range of quality estimation systems employing HTER and direct assessment (DA) of translation adequacy as gold labels are evaluated, resulting in a divergence in system rankings, and a proposed employment of DA for future quality estimation evaluations.

BLEU Might Be Guilty but References Are Not Innocent

TLDR
This paper develops a paraphrasing task for linguists to perform on existing reference translations, which counteracts this bias and reveals that multi-reference BLEU does not improve the correlation for high quality output, and presents an alternative multi- reference formulation that is more effective.

Unifying Human and Statistical Evaluation for Natural Language Generation

TLDR
This paper proposes a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated, called HUSE, which is efficiently estimated by combining human and statistical evaluation.

Using PRMSE to evaluate automated scoring systems in the presence of label noise

TLDR
It is proposed that a new metric, PRMSE, developed within the educational measurement community, can help address the issue of noisy labels in NLP systems, and practical guidelines on usingPRMSE are provided.

Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

TLDR
This work proposes an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework, and carries out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with full document context.

Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance

This paper presents the results of the WMT18 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT18 News Translation Task with automatic

An Empirical Investigation of Statistical Significance in NLP

TLDR
Two aspects of the empirical behavior of paired significance tests for NLP systems are investigated, including when one system appears to outperform another, and once significance levels are computed, how well does the standard i.i.d. notion of significance hold up in practical settings where future distributions are neither independent nor identically distributed.

With Little Power Comes Great Responsibility

TLDR
It is concluded that underpowered experiments are common in the NLP literature and an overview of best practices for power analysis in NLP is given and a series of notebooks are released to assist with future power analyses.
...