Can we do that simpler? Simple, Efficient, High-Quality Evaluation Metrics for NLG

@article{Grunwald2022CanWD,
  title={Can we do that simpler? Simple, Efficient, High-Quality Evaluation Metrics for NLG},
  author={Jens Grunwald and Christoph Leiter and Steffen Eger},
  journal={ArXiv},
  year={2022},
  volume={abs/2209.09593}
}
We explore efficient evaluation metrics for Natural Language Generation (NLG). To implement efficient metrics, we replace (i) computation-heavy transformers in metrics such as BERTScore, MoverScore, BARTScore, XMoverScore, etc. with lighter versions (such as distilled ones) and (ii) cubic inference time alignment algorithms such as Word Mover Distance with linear and quadratic approximations. We consider six evaluation metrics (both monolingual and multilingual), assessed on three different… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 47 REFERENCES

FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metrics for Automatic Text Generation

FrugalScore is proposed, an approach to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance and having several orders of magnitude less parameters.

Searching for COMETINHO: The Little Metric That Could

This paper explores optimization techniques, pruning, and knowledge distillation to create more compact and faster COMET versions and presents DISTIL-COMET a lightweight distilled version that is 80% smaller and 2.128x faster while attaining a performance close to the original model and above strong baselines such as BERTSCORE and PRISM.

Towards Explainable Evaluation Metrics for Natural Language Generation

This concept paper identifies key properties and proposes key goals of explainable machine translation evaluation metrics and provides a vision of future approaches to explainable evaluation metric and their evaluation.

USCORE: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation

This work develops fully unsupervised evaluation metrics that beat supervised com-petitors on four out of four evaluation datasets and leverages similarities and synergies between evaluation metric induction, parallel corpus mining, and MT systems.

Knowledge Distillation for Quality Estimation

This work proposes to directly transfer knowledge from a strong QE teacher model to a much smaller model with a different, shallower architecture, and shows that this approach, in combination with data augmentation, leads to light-weight QE models that perform competitively with distilled pre-trained representations with 8x fewer parameters.

XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

This work studies knowledge distillation with a focus on multilingual Named Entity Recognition (NER) and proposes a stage-wise optimization scheme leveraging teacher internal representations, that is agnostic of teacher architecture, and shows that it outperforms strategies employed in prior works.

Perturbation CheckLists for Evaluating NLG Evaluation Metrics

This work designs templates which target a specific criteria and perturb the output such that the quality gets affected only along this specific criteria, and shows that existing evaluation metrics are not robust against even simple perturbations and disagree with scores assigned by humans to the perturbed output.

RoMe: A Robust Metric for Evaluating Natural Language Generation

This paper proposes an automatic evaluation metric, RoMe, which is trained on language features such as semantic similarity combined with tree edit distance and grammatical acceptability, using a self-supervised neural network to assess the overall quality of the generated sentence.

Layer or Representation Space: What Makes BERT-based Evaluation Metrics Robust?

It is shown that an embedding-based metric that has the highest correlation with human evaluations on a standard benchmark can have the lowest correlation if the amount of input noise or unknown tokens increases, and the highest robustness is achieved when using character-level embeddings, instead of token-based embeddments, from the first layer of the pretrained model.

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

This paper investigates strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality and validate the new metric, namely MoverScore, on a number of text generation tasks.