• Corpus ID: 127986044

BERTScore: Evaluating Text Generation with BERT

@article{Zhang2020BERTScoreET,
  title={BERTScore: Evaluating Text Generation with BERT},
  author={Tianyi Zhang and Varsha Kishore and Felix Wu and Kilian Q. Weinberger and Yoav Artzi},
  journal={ArXiv},
  year={2020},
  volume={abs/1904.09675}
}
We propose BERTScore, an automatic evaluation metric for text generation. [] Key Method However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when…

Figures and Tables from this paper

A Fine-Grained Analysis of BERTScore
TLDR
It is found that while BERTScore can detect when a candidate differs from a reference in important content words, it is less sensitive to smaller errors, especially if the candidate is lexically or stylistically similar to the reference.
BERTTune: Fine-Tuning Neural Machine Translation with BERTScore
TLDR
This paper proposes fine-tuning the models with a novel training objective based on the recently-proposed BERTScore evaluation metric, and proposes three approaches for generating soft predictions, allowing the network to remain completely differentiable end-to-end.
Global Explainability of BERT-Based Evaluation Metrics by Disentangling along Linguistic Factors
TLDR
This work uses a simple regression based global explainability technique to disentangle metric scores along linguistic factors, including semantics, syntax, morphology, and lexical overlap, and shows that the different metrics capture all aspects to some degree, but that they are all substantially sensitive to Lexical overlap.
Rewarding Semantic Similarity under Optimized Alignments for AMR-to-Text Generation
TLDR
This work proposes metrics that replace the greedy alignments in BERTSCORE with optimized ones, and finds that this model enjoys stable training relative to a non-RL setting on AMR-to-text generation.
ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT
TLDR
Experimental results on three benchmark datasets show that the proposed evaluation metric correlates significantly better with human judgments than all existing metrics.
Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance
TLDR
Two techniques for improving encoding representations for similarity metrics are presented: a batch-mean centering strategy that improves statistical properties; and a computationally efficient tempered Word Mover Distance, for better fusion of the information in the contextualized word representations.
A new approach to calculating BERTScore for automatic assessment of translation quality
TLDR
To improve the token matching process it is proposed to combine all incomplete WorkPiece tokens into meaningful words and use simple averaging of corresponding vectors and to calculate BERTScore based on anchor tokens only.
CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation
TLDR
Experimental results show that the proposed unsupervised reference-free metric, CTRLEval, has higher correlations with human judgments than other baselines, while obtain-ing better generalization of evaluating generated texts from different models and with different qualities.
MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
TLDR
This paper investigates strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality and validate the new metric, namely MoverScore, on a number of text generation tasks.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
TLDR
Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity is presented.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 102 REFERENCES
deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets
TLDR
In tasks involving generation of conversational responses, ∆BLEU correlates reasonably with human judgments and outperforms sentence-level and IBM BLEU in terms of both Spearman's ρ and Kendall’s τ.
MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
TLDR
This paper investigates strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality and validate the new metric, namely MoverScore, on a number of text generation tasks.
RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation
TLDR
The RUSE metric is introduced for the WMT18 metrics shared task and a multi-layer perceptron regressor is used based on three types of sentence embeddings to improve the automatic evaluation of machine translation.
Putting Evaluation in Context: Contextual Embeddings Improve Machine Translation Evaluation
TLDR
A simple unsupervised metric is proposed, and additional supervised metrics which rely on contextual word embeddings to encode the translation and reference sentences are proposed, finding that these models rival or surpass all existing metrics in the WMT 2017 sentence-level and system-level tracks.
Simple Applications of BERT for Ad Hoc Document Retrieval
TLDR
This work addresses the challenge posed by documents that are typically longer than the length of input BERT was designed to handle by applying inference on sentences individually, and then aggregating sentence scores to produce document scores.
Learning to Evaluate Image Captioning
TLDR
This work proposes a novel learning based discriminative evaluation metric that is directly trained to distinguish between human and machine-generated captions and proposes a data augmentation scheme to explicitly incorporate pathological transformations as negative examples during training.
Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts
TLDR
This work introduces methods based on sentence mover’s similarity, and finds that sentence-based metrics correlate with human judgments significantly better than ROUGE, both on machine-generated summaries and human-authored essays.
Fine-tune BERT for Extractive Summarization
TLDR
BERTSUM, a simple variant of BERT, for extractive summarization, is described, which is the state of the art on the CNN/Dailymail dataset, outperforming the previous best-performed system by 1.65 on ROUGE-L.
SPICE: Semantic Propositional Image Caption Evaluation
There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram
Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance
This paper presents the results of the WMT18 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT18 News Translation Task with automatic
...
1
2
3
4
5
...