Corpus ID: 232307594

BlonD: An Automatic Evaluation Metric for Document-level MachineTranslation

@article{Jiang2021BlonDAA,
  title={BlonD: An Automatic Evaluation Metric for Document-level MachineTranslation},
  author={Yuchen Jiang and Shuming Ma and Dongdong Zhang and Jian Yang and Haoyang Huang and Ming Zhou},
  journal={ArXiv},
  year={2021},
  volume={abs/2103.11878}
}
Standard automatic metrics (such as BLEU) are problematic for document-level MT evaluation. They can neither distinguish documentlevel improvements in translation quality from sentence-level ones, nor can they identify the specific discourse phenomena that caused the translation errors. To address these problems, we propose an automatic metric BlonD1 for document-level machine translation evaluation. BlonD takes discourse coherence into consideration by calculating the recall and distance of… Expand

Figures and Tables from this paper

References

SHOWING 1-10 OF 53 REFERENCES
Evaluating Pronominal Anaphora in Machine Translation: An Evaluation Measure and a Test Suite
TLDR
An extensive, targeted dataset is contributed that can be used as a test suite for pronoun translation, covering multiple source languages and different pronoun errors drawn from real system translations, for English and an evaluation measure is proposed to differentiate good and bad pronoun translations. Expand
DiscoTK: Using Discourse Structure for Machine Translation Evaluation
We present novel automatic metrics for machine translation evaluation that use discourse structure and convolution kernels to compare the discourse tree of an automatic translation with that of theExpand
Extending Machine Translation Evaluation Metrics with Lexical Cohesion to Document Level
TLDR
Experimental results show that incorporating this feature into sentence-level evaluation metrics can enhance their correlation with human judgements. Expand
Assessing the Accuracy of Discourse Connective Translations: Validation of an Automatic Metric
TLDR
This paper introduces a reference-based metric that is focused on a particular class of function words, namely discourse connectives, of particular importance for text structuring, and rather challenging for MT. Expand
When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion
TLDR
This work performs a human study on an English-Russian subtitles dataset and identifies deixis, ellipsis and lexical cohesion as three main sources of inconsistency as well as introducing a model suitable for this scenario and demonstrating major gains over a context-agnostic baseline on new benchmarks without sacrificing performance as measured with BLEU. Expand
Document-Level Automatic MT Evaluation based on Discourse Representations
TLDR
Preliminary experiments, based on an extension of the metrics by Gimenez and Marquez (2009) operating over discourse representations, are presented, aimed at widening the scope of current automatic evaluation measures from sentence to document level. Expand
A Large-Scale Test Set for the Evaluation of Context-Aware Pronoun Translation in Neural Machine Translation
TLDR
This paper presents a test suite of contrastive translations focused specifically on the translation of pronouns and shows that, while gains in BLEU are moderate for those systems, they outperform baselines by a large margin in terms of accuracy on the contrastive test set. Expand
Diagnostic Evaluation of Machine Translation Systems Using Automatically Constructed Linguistic Check-Points
TLDR
A method that automatically extracts check-points from parallel sentences that can monitor a MT system in translating important linguistic phenomena to provide diagnostic evaluation is presented. Expand
Validation of an Automatic Metric for the Accuracy of Pronoun Translation (APT)
TLDR
A reference-based metric is defined and assessed to evaluate the accuracy of pronoun translation (APT), which automatically aligns a candidate and a reference translation using GIZA++ augmented with specific heuristics, and then counts the number of identical or different pronouns. Expand
ROUGE: A Package for Automatic Evaluation of Summaries
TLDR
Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations. Expand
...
1
2
3
4
5
...