The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results

@article{Fomicheva2021TheES,
  title={The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results},
  author={M. Fomicheva and Piyawat Lertvittayakumjorn and Wei Zhao and Steffen Eger and Yang Gao},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.04392}
}
In this paper, we introduce the Eval4NLP-2021 shared task on explainable quality estimation. Given a source-translation pair, this shared task requires not only to provide a sentencelevel score indicating the overall quality of the translation, but also to explain this score by identifying the words that negatively impact translation quality. We present the data, annotation guidelines and evaluation setup of the shared task, describe the six participating systems, and analyze the results. To… 

Figures and Tables from this paper

The UMD Submission to the Explainable MT Quality Estimation Shared Task: Combining Explanation Models with Sequence Labeling
This paper describes the UMD submission to the Explainable Quality Estimation Shared Task at the Eval4NLP 2021 Workshop on “Evaluation & Comparison of NLP Systems”. We participated in the word-level
IST-Unbabel 2021 Submission for the Explainable Quality Estimation Shared Task
We present the joint contribution of Instituto Superior Técnico (IST) and Unbabel to the Explainable Quality Estimation (QE) shared task, where systems were submitted to two tracks: constrained
Error Identification for Machine Translation with Metric Embedding and Attention
Quality Estimation (QE) for Machine Translation has been shown to reach relatively high accuracy in predicting sentence-level scores, relying on pretrained contextual embeddings and human-produced
Reference-Free Word- and Sentence-Level Translation Evaluation with Token-Matching Metrics
Many modern machine translation evaluation metrics like BERTScore, BLEURT, COMET, MonoTransquest or XMoverScore are based on black-box language models. Hence, it is difficult to explain why these
Explaining Errors in Machine Translation with Absolute Gradient Ensembles
Current research on quality estimation of machine translation focuses on the sentence-level quality of the translations. By using explainability methods, we can use these quality estimations for
Global Explainability of BERT-Based Evaluation Metrics by Disentangling along Linguistic Factors
TLDR
This work uses a simple regression based global explainability technique to disentangle metric scores along linguistic factors, including semantics, syntax, morphology, and lexical overlap, and shows that the different metrics capture all aspects to some degree, but that they are all substantially sensitive to Lexical overlap.

References

SHOWING 1-10 OF 65 REFERENCES
Explainable Quality Estimation: CUNI Eval4NLP Submission
This paper describes our participating system in the shared task Explainable quality estimation of 2nd Workshop on Evaluation & Comparison of NLP Systems. The task of quality estimation (QE, a.k.a.
The UMD Submission to the Explainable MT Quality Estimation Shared Task: Combining Explanation Models with Sequence Labeling
This paper describes the UMD submission to the Explainable Quality Estimation Shared Task at the Eval4NLP 2021 Workshop on “Evaluation & Comparison of NLP Systems”. We participated in the word-level
IST-Unbabel 2021 Submission for the Explainable Quality Estimation Shared Task
We present the joint contribution of Instituto Superior Técnico (IST) and Unbabel to the Explainable Quality Estimation (QE) shared task, where systems were submitted to two tracks: constrained
Findings of the WMT 2019 Shared Tasks on Quality Estimation
TLDR
The WMT19 shared task on Quality Estimation is reported, the task of predicting the quality of the output of machine translation systems given just the source text and the hypothesis translations, with a novel addition is evaluating sentence-level QE against human judgments.
Error Identification for Machine Translation with Metric Embedding and Attention
Quality Estimation (QE) for Machine Translation has been shown to reach relatively high accuracy in predicting sentence-level scores, relying on pretrained contextual embeddings and human-produced
Reference-Free Word- and Sentence-Level Translation Evaluation with Token-Matching Metrics
Many modern machine translation evaluation metrics like BERTScore, BLEURT, COMET, MonoTransquest or XMoverScore are based on black-box language models. Hence, it is difficult to explain why these
Translation Error Detection as Rationale Extraction
TLDR
A novel semi-supervised method for word-level QE is introduced and it is proposed to use the QE task as a new benchmark for evaluating the plausibility of feature attribution, i.e. how interpretable model explanations are to humans.
Two-Phase Cross-Lingual Language Model Fine-Tuning for Machine Translation Quality Estimation
TLDR
The Bering Lab’s submission to the WMT 2020 Shared Task on Quality Estimation (QE) fine-tune XLM-RoBERTa, the state-of-the-art cross-lingual language model, with a few additional parameters, for word-level and sentence-level translation quality estimation.
Findings of the WMT 2020 Shared Task on Quality Estimation
TLDR
This edition of the WMT20 shared task on Quality Estimation included new data with open domain texts, direct assessment annotations, and multiple language pairs: English-German, English-Chinese, Russian-English, Romanian- English, Estonian-English and Nepali-English data for the sentence-level subtasks.
Explaining Errors in Machine Translation with Absolute Gradient Ensembles
Current research on quality estimation of machine translation focuses on the sentence-level quality of the translations. By using explainability methods, we can use these quality estimations for
...
1
2
3
4
5
...