MENLI: Robust Evaluation Metrics from Natural Language Inference

@article{Chen2022MENLIRE,
  title={MENLI: Robust Evaluation Metrics from Natural Language Inference},
  author={Yanran Chen and Steffen Eger},
  journal={ArXiv},
  year={2022},
  volume={abs/2208.07316}
}
Recently proposed BERT-based evaluation metrics perform well on standard evaluation benchmarks but are vulnerable to adversarial attacks, e.g., relating to factuality errors. We argue that this stems (in part) from the fact that they are models of semantic similarity. In contrast, we develop evaluation metrics based on Natural Language Inference (NLI), which we deem a more appropriate modeling. We design a preference-based adversarial attack framework and show that our NLI based metrics are… 

Layer or Representation Space: What Makes BERT-based Evaluation Metrics Robust?

It is shown that an embedding-based metric that has the highest correlation with human evaluations on a standard benchmark can have the lowest correlation if the amount of input noise or unknown tokens increases, and the highest robustness is achieved when using character-level embeddings, instead of token-based embeddments, from the first layer of the pretrained model.

Reproducibility Issues for BERT-based Evaluation Metrics

This paper asks whether results and claims from four recent BERT-based metrics can be reproduced and finds that reproduction of claims and results often fails because of heavy undocumented preprocessing involved in the metrics, missing code and reporting weaker results for the baseline metrics.

Can we do that simpler? Simple, Efficient, High-Quality Evaluation Metrics for NLG

TinyBERT shows best quality-efficiency tradeoff for semantic similarity metrics of the BERTScore family, retaining 97% quality and being 5x faster at inference time on average, and there is a large difference in speed-ups on CPU vs. GPU.

USCORE: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation

This work develops fully unsupervised evaluation metrics that beat supervised com-petitors on four out of four evaluation datasets and leverages similarities and synergies between evaluation metric induction, parallel corpus mining, and MT systems.

References

SHOWING 1-10 OF 65 REFERENCES

Global Explainability of BERT-Based Evaluation Metrics by Disentangling along Linguistic Factors

This work uses a simple regression based global explainability technique to disentangle metric scores along linguistic factors, including semantics, syntax, morphology, and lexical overlap, and shows that the different metrics capture all aspects to some degree, but that they are all substantially sensitive to Lexical overlap.

Are Factuality Checkers Reliable? Adversarial Meta-evaluation of Factuality in Summarization

An adversarial meta-evaluation methodology that allows to diagnose the strengths and weaknesses of 6 existing top-performing metrics over 24 diagnostic test datasets and search for directions for further improvement by data augmentation is presented.

BERTScore: Evaluating Text Generation with BERT

This work proposes BERTScore, an automatic evaluation metric for text generation that correlates better with human judgments and provides stronger model selection performance than existing metrics.

Adversarial NLI: A New Benchmark for Natural Language Understanding

This work introduces a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure, and shows that non-expert annotators are successful at finding their weaknesses.

USCORE: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation

This work develops fully unsupervised evaluation metrics that beat supervised com-petitors on four out of four evaluation datasets and leverages similarities and synergies between evaluation metric induction, parallel corpus mining, and MT systems.

Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

This work proposes a metric to evaluate the content quality of a summary using question-answering (QA), and identifies its performance bottlenecks and estimates that its potential upper-bound performance surpasses all other automatic metrics, approaching that of the gold-standard Pyramid Method.

Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference

This paper evaluates summaries produced by state-of-the-art models via crowdsourcing and shows that such errors occur frequently, in particular with more abstractive models, which leads to an interesting downstream application for entailment models.

RoMe: A Robust Metric for Evaluating Natural Language Generation

This paper proposes an automatic evaluation metric, RoMe, which is trained on language features such as semantic similarity combined with tree edit distance and grammatical acceptability, using a self-supervised neural network to assess the overall quality of the generated sentence.

Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

This work proposes an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework, and carries out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with full document context.

Mind the Trade-off: Debiasing NLU Models without Degrading the In-distribution Performance

A novel debiasing method, called confidence regularization, is introduced, which discourage models from exploiting biases while enabling them to receive enough incentive to learn from all the training examples, which improves the performance on out-of-distribution datasets while maintaining the original in-dist distribution accuracy.
...