The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results

@article{Fomicheva2021TheES,
  title={The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results},
  author={M. Fomicheva and Piyawat Lertvittayakumjorn and Wei Zhao and Steffen Eger and Yang Gao},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.04392}
}
In this paper, we introduce the Eval4NLP-2021 shared task on explainable quality estimation. Given a source-translation pair, this shared task requires not only to provide a sentencelevel score indicating the overall quality of the translation, but also to explain this score by identifying the words that negatively impact translation quality. We present the data, annotation guidelines and evaluation setup of the shared task, describe the six participating systems, and analyze the results. To… 

Figures and Tables from this paper

The UMD Submission to the Explainable MT Quality Estimation Shared Task: Combining Explanation Models with Sequence Labeling
TLDR
The UMD approach combines the predictions of a word-level explainer model on top of a sentence-level QE model and a sequence labeler trained on synthetic data, making them well suited to zero-shot settings.
IST-Unbabel 2021 Submission for the Explainable Quality Estimation Shared Task
TLDR
Improve the performance of the joint contribution of Instituto Superior Técnico and Unbabel to the Explainable Quality Estimation shared task by ensembling explanation scores extracted from models trained with different pre-trained transformers, achieving strong results for in-domain and zero-shot language pairs.
Error Identification for Machine Translation with Metric Embedding and Attention
TLDR
This paper proposes a novel QE architecture which tackles both the wordlevel data scarcity and the interpretability limitations of recent approaches, and combines sentence-level and word-level components jointly pretrained through an attention mechanism based on synthetic data and a set of MT metrics embedded in a common space.
Reference-Free Word- and Sentence-Level Translation Evaluation with Token-Matching Metrics
TLDR
It is shown that unsupervised metrics that are based on tokenmatching can intrinsically provide feature importance scores that correlate well with human word-level error annotations.
Explaining Errors in Machine Translation with Absolute Gradient Ensembles
TLDR
This work compares different explainability techniques and investigates gradient-based and perturbation-based methods by measuring their performance and required computational efforts, observing that using absolute word scores boosts the performance of gradient- based explainers significantly.
Towards Explainable Evaluation Metrics for Natural Language Generation
TLDR
This concept paper identifies key properties and proposes key goals of explainable machine translation evaluation metrics and provides a vision of future approaches to explainable evaluation metric and their evaluation.
USCORE: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation
TLDR
This work develops fully unsupervised evaluation metrics that beat supervised competitors on 4 out of 5 evaluation datasets and induce unsuper supervised multilingual sentence embeddings from pseudo-parallel data.
Global Explainability of BERT-Based Evaluation Metrics by Disentangling along Linguistic Factors
TLDR
This work uses a simple regression based global explainability technique to disentangle metric scores along linguistic factors, including semantics, syntax, morphology, and lexical overlap, and shows that the different metrics capture all aspects to some degree, but that they are all substantially sensitive to Lexical overlap.
DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence
TLDR
DiscoScore is introduced, a parametrized 008 discourse metric, which uses BERT to model 009 discourse coherence from different perspec- 010 tives, driven by Centering theory, and surpasses BARTScore by over 10 correlation points on 027 average.
Learning to Scaffold: Optimizing Model Explanations for Teaching
TLDR
This work trains models on three natural language processing and computer vision tasks, and finds that students trained with explanations extracted with this framework are able to simulate the teacher more effectively than ones produced with previous methods.
...
1
2
...

References

SHOWING 1-10 OF 66 REFERENCES
Explainable Quality Estimation: CUNI Eval4NLP Submission
TLDR
This work first builds a word-level quality estimation model, then finetune this model for sentence-level QE, and achieves near state-of-the-art results.
The UMD Submission to the Explainable MT Quality Estimation Shared Task: Combining Explanation Models with Sequence Labeling
TLDR
The UMD approach combines the predictions of a word-level explainer model on top of a sentence-level QE model and a sequence labeler trained on synthetic data, making them well suited to zero-shot settings.
IST-Unbabel 2021 Submission for the Explainable Quality Estimation Shared Task
TLDR
Improve the performance of the joint contribution of Instituto Superior Técnico and Unbabel to the Explainable Quality Estimation shared task by ensembling explanation scores extracted from models trained with different pre-trained transformers, achieving strong results for in-domain and zero-shot language pairs.
Findings of the WMT 2019 Shared Tasks on Quality Estimation
TLDR
The WMT19 shared task on Quality Estimation is reported, the task of predicting the quality of the output of machine translation systems given just the source text and the hypothesis translations, with a novel addition is evaluating sentence-level QE against human judgments.
Error Identification for Machine Translation with Metric Embedding and Attention
TLDR
This paper proposes a novel QE architecture which tackles both the wordlevel data scarcity and the interpretability limitations of recent approaches, and combines sentence-level and word-level components jointly pretrained through an attention mechanism based on synthetic data and a set of MT metrics embedded in a common space.
Translation Error Detection as Rationale Extraction
TLDR
A novel semi-supervised method for word-level QE is introduced and it is proposed to use the QE task as a new benchmark for evaluating the plausibility of feature attribution, i.e. how interpretable model explanations are to humans.
Reference-Free Word- and Sentence-Level Translation Evaluation with Token-Matching Metrics
TLDR
It is shown that unsupervised metrics that are based on tokenmatching can intrinsically provide feature importance scores that correlate well with human word-level error annotations.
Two-Phase Cross-Lingual Language Model Fine-Tuning for Machine Translation Quality Estimation
TLDR
The Bering Lab’s submission to the WMT 2020 Shared Task on Quality Estimation (QE) fine-tune XLM-RoBERTa, the state-of-the-art cross-lingual language model, with a few additional parameters, for word-level and sentence-level translation quality estimation.
Findings of the WMT 2020 Shared Task on Quality Estimation
TLDR
This edition of the WMT20 shared task on Quality Estimation included new data with open domain texts, direct assessment annotations, and multiple language pairs: English-German, English-Chinese, Russian-English, Romanian- English, Estonian-English and Nepali-English data for the sentence-level subtasks.
Explaining Errors in Machine Translation with Absolute Gradient Ensembles
TLDR
This work compares different explainability techniques and investigates gradient-based and perturbation-based methods by measuring their performance and required computational efforts, observing that using absolute word scores boosts the performance of gradient- based explainers significantly.
...
1
2
3
4
5
...