MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

  title={MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance},
  author={Wei Zhao and Maxime Peyrard and Fei Liu and Yang Gao and Christian M. Meyer and Steffen Eger},
A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality. We validate our new metric, namely MoverScore, on a number of text generation tasks including summarization, machine… 

Figures and Tables from this paper

Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance
Two techniques for improving encoding representations for similarity metrics are presented: a batch-mean centering strategy that improves statistical properties; and a computationally efficient tempered Word Mover Distance, for better fusion of the information in the contextualized word representations.
MaskEval: Weighted MLM-Based Evaluation for Text Summarization and Simplification
This work introduces MaskEval, a reference-less metric for text summarization and simplification that is adapted to evaluate different quality dimensions by performing masked language modeling on the concatenation of the candidate and the source texts by featuring an attention-like weighting mechanism to mod-ulate the relative importance of each MLM step.
BLEURT: Learning Robust Metrics for Text Generation
BLEURT, a learned evaluation metric for English based on BERT, can model human judgment with a few thousand possibly biased training examples and yields superior results even when the training data is scarce and out-of-distribution.
CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation
Experimental results show that the proposed unsupervised reference-free metric, CTRLEval, has higher correlations with human judgments than other baselines, while obtaining better generalization of evaluating generated texts from different models and with different qualities.
BARTScore: Evaluating Generated Text as Text Generation
This work conceptualizes the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models, and proposes a metric BARTS CORE with a number of variants that can be flexibly applied in an unsupervised fashion to evaluation of text from different perspectives.
A large-scale computational study of content preservation measures for text style transfer and paraphrase generation
The Mutual Implication Score (MIS) is introduced, a measure that uses the idea of paraphrasing as a bidirectional entailment and outperforms all other measures on the paraphrase detection task and performs on par with the best measures in the text style transfer task.
Evaluating Natural Language Generation via Unbalanced Optimal Transport
Inspired by the optimal transportation theory, this paper proves that these metrics correspond to the optimal transport problem with different hard marginal constraints, and proposes a family of new evaluation metrics, namely Lazy Earth Mover’s Distances, based on the more general unbalanced optimal transportation problem.
COMET: A Neural Framework for MT Evaluation
This framework leverages recent breakthroughs in cross-lingual pretrained language modeling resulting in highly multilingual and adaptable MT evaluation models that exploit information from both the source input and a target-language reference translation in order to more accurately predict MT quality.
Toward Interpretable Semantic Textual Similarity via Optimal Transport-based Contrastive Sentence Learning
This work explicitly describes the sentence distance as the weighted sum of contextualized token distances on the basis of a transportation problem, and presents the optimal transport-based distance measure, named RCMD; it identifies and leverages semantically-aligned token pairs and enhances the quality of sentence similarity and their interpretation.
Sentence Pair Embeddings Based Evaluation Metric for Abstractive and Extractive Summarization
This work proposes a novel evaluation metric, Sentence Pair EmbEDdings ( SPEED), for text generation which is based on semantic similarity between sentence pairs as opposed to earlier approaches and achieves state-of-the-art results on the SummEval dataset, demonstrating the effectiveness of this approach.


Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts
This work introduces methods based on sentence mover’s similarity, and finds that sentence-based metrics correlate with human judgments significantly better than ROUGE, both on machine-generated summaries and human-authored essays.
From Word Embeddings To Document Distances
It is demonstrated on eight real world document classification data sets, in comparison with seven state-of-the-art baselines, that the Word Mover's Distance metric leads to unprecedented low k-nearest neighbor document classification error rates.
RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation
The RUSE metric is introduced for the WMT18 metrics shared task and a multi-layer perceptron regressor is used based on three types of sentence embeddings to improve the automatic evaluation of machine translation.
Get To The Point: Summarization with Pointer-Generator Networks
A novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways, using a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator.
BERTScore: Evaluating Text Generation with BERT
This work proposes BERTScore, an automatic evaluation metric for text generation that correlates better with human judgments and provides stronger model selection performance than existing metrics.
Better Summarization Evaluation with Word Embeddings for ROUGE
This proposal uses word embeddings to overcome the disadvantage of ROUGE in the evaluation of abstractive summarization, or summaries with substantial paraphrasing, by using them to compute the semantic similarity of the words used in summaries instead.
Meteor++: Incorporating Copy Knowledge into Machine Translation Evaluation
A simple statistical method for copy knowledge extraction is introduced, and incorporated into Meteor metric, resulting in a new machine translation metric Meteor++, which can nicely integrate copy knowledge and improve the performance significantly on WMT17 and WMT15 evaluation sets.
SPICE: Semantic Propositional Image Caption Evaluation
There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram
Why We Need New Evaluation Metrics for NLG
A wide range of metrics are investigated, including state-of-the-art word-based and novel grammar-based ones, and it is demonstrated that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG.
MEANT 2.0: Accurate semantic MT evaluation for any output language
We describe a new version of MEANT, which participated in the metrics task of the Second Conference on Machine Translation (WMT 2017). MEANT 2.0 uses idfweighted distributional ngram accuracy to