Global Explainability of BERT-Based Evaluation Metrics by Disentangling along Linguistic Factors

  title={Global Explainability of BERT-Based Evaluation Metrics by Disentangling along Linguistic Factors},
  author={Marvin Kaster and Wei Zhao and Steffen Eger},
Evaluation metrics are a key ingredient for progress of text generation systems. In recent years, several BERT-based evaluation metrics have been proposed (including BERTScore, MoverScore, BLEURT, etc.) which correlate much better with human assessment of text generation quality than BLEU or ROUGE, invented two decades ago. However, little is known what these metrics, which are based on black-box language model representations, actually capture (it is typically assumed they model semantic… 
Towards Explainable Evaluation Metrics for Natural Language Generation
This concept paper identifies key properties and proposes key goals of explainable machine translation evaluation metrics and provides a vision of future approaches to explainable evaluation metric and their evaluation.
SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable AMR Meaning Features
This work creates similarity metrics that are highly effective, while also providing an interpretable rationale for their rating, and employs these metrics to induce Semantically Structured Sentence BERT embeddings (SBERT), which are composed of different meaning aspects captured in different sub-spaces.
Pre-trained language models evaluating themselves - A comparative study
This work examines the recently introduced metrics BERTScore, BLEURT, NUBIA, MoverScore, and Mark-Evaluate and finds no metric showed appropriate behaviour for negation, and further none of them was overall sensitive to the other issues mentioned above.
USCORE: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation
This work develops fully unsupervised evaluation metrics that beat supervised competitors on 4 out of 5 evaluation datasets and induce unsuper supervised multilingual sentence embeddings from pseudo-parallel data.
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years and lays out a long-term vision for NLG evaluation and proposes concrete steps to improve their evaluation processes.
DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence
DiscoScore is introduced, a parametrized 008 discourse metric, which uses BERT to model 009 discourse coherence from different perspec- 010 tives, driven by Centering theory, and surpasses BARTScore by over 10 correlation points on 027 average.
Multi-Objective Hyperparameter Optimization -- An Overview
Overview FLORIAN KARL∗, Fraunhofer Institut für integrierte Schaltungen, Germany TOBIAS PIELOK∗, Ludwig-Maximilians-Universität München, Germany JULIA MOOSBAUER, Ludwig-Maximilians-Universität


BLEURT: Learning Robust Metrics for Text Generation
BLEURT, a learned evaluation metric for English based on BERT, can model human judgment with a few thousand possibly biased training examples and yields superior results even when the training data is scarce and out-of-distribution.
BERTScore: Evaluating Text Generation with BERT
This work proposes BERTScore, an automatic evaluation metric for text generation that correlates better with human judgments and provides stronger model selection performance than existing metrics.
MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
This paper investigates strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality and validate the new metric, namely MoverScore, on a number of text generation tasks.
BLEU Might Be Guilty but References Are Not Innocent
This paper develops a paraphrasing task for linguists to perform on existing reference translations, which counteracts this bias and reveals that multi-reference BLEU does not improve the correlation for high quality output, and presents an alternative multi- reference formulation that is more effective.
BERGAMOT-LATTE Submissions for the WMT20 Quality Estimation Shared Task
The authors' black-box QE models tied for the winning submission in four out of seven language pairs in Task 1, thus demonstrating very strong performance, and the glass-box approaches also performed competitively, representing a light-weight alternative to the neural-based models.
BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively Inspired Orthographic Adversarial Attacks
This work shows that an untrained iterative approach which combines context-independent character-level information with context-dependent information from BERT’s masked language modeling can perform on par with human crowd-workers from Amazon Mechanical Turk supervised via 3-shot learning.
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity is presented.
SBERT-WK: A Sentence Embedding Method by Dissecting BERT-Based Word Models
  • Bin Wang, C.-C. Jay Kuo
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2020
This work proposes a new sentence embedding method by dissecting BERT-based word models through geometric analysis of the space spanned by the word representation, called SBERT-WK, which achieves the state-of-the-art performance.
SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization
This work proposes SUPERT, which rates the quality of a summary by measuring its semantic similarity with a pseudo reference summary, i.e. selected salient sentences from the source documents, using contextualized embeddings and soft token alignment techniques.
Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems
This work investigates the impact of visual adversarial attacks on current NLP systems on character-, word-, and sentence-level tasks, showing that both neural and non-neural models are, in contrast to humans, extremely sensitive to such attacks, suffering performance decreases of up to 82%.