BlonDe: An Automatic Evaluation Metric for Document-level Machine Translation

  title={BlonDe: An Automatic Evaluation Metric for Document-level Machine Translation},
  author={Yu Jiang and Tianyu Liu and Shuming Ma and Dongdong Zhang and Jian Yang and Haoyang Huang and Rico Sennrich and Ryan Cotterell and Mrinmaya Sachan and M. Zhou},
Standard automatic metrics, e.g. BLEU, are not reliable for document-level MT evaluation. They can neither distinguish document-level improvements in translation quality from sentence-level ones, nor identify the discourse phenomena that cause context-agnostic translations. This paper introduces a novel automatic metric BlonDe to widen the scope of automatic MT evaluation from sentence to document level. BlonDe takes discourse coherence into consideration by categorizing discourse-related spans… 

Multilingual Transitivity and Bidirectional Multilingual Agreement for Multilingual Document-level Machine Translation

A novel framework called M ultilingual Trans itivity (MTrans) is proposed to bring consistent improvements over strong baselines on three document translation tasks and a novel method called MKL that forces the output distribution of the inputs with the same meaning but in different languages to be consistent with each other is proposed.

Embarrassingly Easy Document-Level MT Metrics: How to Convert Any Pretrained Metric Into a Document-Level Metric

The experimental results support the initial hypothesis and show that a simple ex-tension of the metrics permits them to take advantage of context to resolve ambiguities in the reference.



Results of the WMT20 Metrics Shared Task

An extensive analysis on influence of different reference translations on metric reliability, how well automatic metrics score human translations, and major discrepancies between metric and human scores when evaluating MT systems are presented.

Improving the Transformer Translation Model with Document-Level Context

This work extends the Transformer model with a new context encoder to represent document-level context, which is then incorporated into the original encoder and decoder, and introduces a two-step training method to take full advantage of abundant sentence-level parallel corpora and limited document- level parallel Corpora.

A Study of Translation Edit Rate with Targeted Human Annotation

A new, intuitive measure for evaluating machine-translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments is examined, which indicates that HTER correlates with human judgments better than HMETEOR and that the four-reference variants of TER and HTER correlate withhuman judgments as well as—or better than—a second human judgment does.

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

METEOR is described, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machineproduced translation and human-produced reference translations and can be easily extended to include more advanced matching strategies.

Bootstrapping Dialog Systems with Word Embeddings

This work investigates the use of word embeddings in a text classification task with little training data and proposes a simple alternative, vector extrema, to replace the usual averaging of a sentence’s vectors.

A Comparison of the

represents adverbs within higher-order modal logic. This paper will show some of the comparative advantages of the former over the latter. Despite these advantages, the extensional formalisation has


Each track’s goal, data and evaluation metrics are introduced, and the results of the received submissions are reported.

Corpora for Document-Level Neural Machine Translation

A novel document parallel corpus in Chinese-Portuguese is constructed, which is a non-English-centred and low-resourced language pair and the commonly-cited document-level method is implemented and evaluated on top of the advanced Transformer model with universal settings.

A Set of Recommendations for Assessing Human-Machine Parity in Language Translation

It is shown that the professional human translations contained significantly fewer errors, and that perceived quality in human evaluation depends on the choice of raters, the availability of linguistic context, and the creation of reference translations.