Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis

  title={Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis},
  author={Wenda Xu and Yi-Lin Tuan and Yujie Lu and Michael Stephen Saxon and Lei Li and William Yang Wang},
Is it possible to build a general and automatic natural language generation (NLG) evaluation metric? Existing learned metrics either perform unsatisfactorily or are restricted to tasks where large human rating data is already available. We introduce SES CORE , a model-based metric that is highly correlated with human judgements without requiring human annota-tion, by utilizing a novel, iterative error synthesis and severity scoring pipeline. This pipeline applies a series of plausible errors to… 

Figures and Tables from this paper

SEScore2: Retrieval Augmented Pretraining for Text Generation Evaluation

The authors' unsupervised SEScore2 can outperform supervised metrics, which are trained on the News human ratings, at the TED domain and even outperforms SOTA supervised BLEURT at data-to-text, dialogue generation and overall correlation.

Foveate, Attribute, and Rationalize: Towards Safe and Trustworthy AI

F ARM 1 is proposed, a novel framework that leverages external knowledge for trustworthy rationale generation in the context of safety, and foveates on missing knowledge in specific scenarios, retrieves this knowledge with attribution to trustworthy sources, and uses this to both classify the safety of the original text and generate human-interpretable rationales.

Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust

The results demonstrate the superiority of neural-based learned metrics and demonstrate again that overlap metrics like Bleu, spBleu or chrf correlate poorly with human ratings, and reveal that neural- based metrics are remarkably robust across different domains and challenges.



Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

This work proposes an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework, and carries out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with full document context.

Perturbation CheckLists for Evaluating NLG Evaluation Metrics

This work designs templates which target a specific criteria and perturb the output such that the quality gets affected only along this specific criteria, and shows that existing evaluation metrics are not robust against even simple perturbations and disagree with scores assigned by humans to the perturbed output.

BARTScore: Evaluating Generated Text as Text Generation

This work conceptualizes the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models, and proposes a metric BARTS CORE with a number of variants that can be flexibly applied in an unsupervised fashion to evaluation of text from different perspectives.

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

This paper investigates strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality and validate the new metric, namely MoverScore, on a number of text generation tasks.

Machine Translation Evaluation using Bi-directional Entailment

A new metric for Machine Translation evaluation, based on bi-directional entailment, using BERT's pre-trained implementation of transformer networks, fine-tuned on MNLI corpus, for natural language inferencing and finds that this metric has a better correlation with the human annotated score compared to the other traditional metrics at system level.

BLEURT: Learning Robust Metrics for Text Generation

BLEURT, a learned evaluation metric for English based on BERT, can model human judgment with a few thousand possibly biased training examples and yields superior results even when the training data is scarce and out-of-distribution.

COMET: A Neural Framework for MT Evaluation

This framework leverages recent breakthroughs in cross-lingual pretrained language modeling resulting in highly multilingual and adaptable MT evaluation models that exploit information from both the source input and a target-language reference translation in order to more accurately predict MT quality.

Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing

This work proposes the use of a sequence-to-sequence paraphraser for automatic machine translation evaluation, and finds that the model conditioned on the source instead of the reference outperforms every quality estimation as a metric system from the WMT19 shared task on quality estimation by a statistically significant margin in every language pair.

A Study of Translation Edit Rate with Targeted Human Annotation

A new, intuitive measure for evaluating machine-translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments is examined, which indicates that HTER correlates with human judgments better than HMETEOR and that the four-reference variants of TER and HTER correlate withhuman judgments as well as—or better than—a second human judgment does.

RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation

The RUSE metric is introduced for the WMT18 metrics shared task and a multi-layer perceptron regressor is used based on three types of sentence embeddings to improve the automatic evaluation of machine translation.