A Set of Recommendations for Assessing Human-Machine Parity in Language Translation

  title={A Set of Recommendations for Assessing Human-Machine Parity in Language Translation},
  author={Samuel L{\"a}ubli and Sheila Castilho and Graham Neubig and Rico Sennrich and Qinlan Shen and Antonio Toral},
  journal={J. Artif. Intell. Res.},
The quality of machine translation has increased remarkably over the past years, to the degree that it was found to be indistinguishable from professional human translation in a number of empirical investigations. We reassess Hassan et al.'s 2018 investigation into Chinese to English news translation, showing that the finding of human–machine parity was owed to weaknesses in the evaluation design—which is currently considered best practice in the field. We show that the professional human… 

Tables from this paper

On “Human Parity” and “Super Human Performance” in Machine Translation Evaluation

This paper reassess claims of human parity and super human performance in machine translation and shows that the terms used are themselves problematic, and that human translation involves much more than what is embedded in automatic systems.

The Suboptimal WMT Test Sets and Their Impact on Human Parity

It is argued that the low quality of the source test set of the news track at WMT may lead to an overrated human parity claim, and the first attempt to question the “source” side of the test set as a potential cause of the overclaim of human parity.

Some Translation Studies informed suggestions for further balancing methodologies for machine translation quality evaluation

This article intends to contribute to the current debate on the quality of neural machine translation (NMT) vs. (professional) human translation quality, where recently claims concerning

A human evaluation of English-Irish statistical and neural machine translation

This paper provides the first human evaluation study of EN-GA MT using professional translators and in-domain (public administration) data for a more accurate depiction of the translation quality available via MT.

HilMeMe: A Human-in-the-Loop Machine Translation Evaluation Metric Looking into Multi-Word Expressions

The design and implementation of a linguistically motivated human-in-the-loop evaluation metric looking into idiomatic and terminological Multi-word Expressions (MWEs) is described.

A Natural Diet: Towards Improving Naturalness of Machine Translation Output

This work proposes a method for training MT systems to achieve a more natural style, i.e. mirroring the style of text originally written in the target language, and finds that their output is preferred by human experts when compared to the baseline translations.

KoBE: Knowledge-Based Machine Translation Evaluation

This work proposes a simple and effective method for machine translation evaluation which does not require reference translations, and achieves the highest correlation with human judgements on 9 out of the 18 language pairs from the WMT19 benchmark for evaluation without references.

A User Study of the Incremental Learning in NMT

A user study involving professional, experienced post-editors on on-the-fly adaptation of neural machine translation systems shows that adaptive systems were able to learn how to generate the correct translation for task-specific terms, resulting in an improvement of the user’s productivity.

Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

This paper presents the first large-scale metaevaluation of machine translation (MT) conducted in 769 research papers published from 2010 to 2020 and proposes a guideline to encourage better automatic MT evaluation along with a simple meta-evaluation scoring method to assess its credibility.

A Bilingual Parallel Corpus with Discourse Annotations

B, a large parallel corpus introduced in Jiang et al. (2022), is described, along with an annotated test set, designed to probe the ability of machine translation systems to model various discourse phenomena.



Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation

We reassess a recent study (Hassan et al., 2018) that claimed that machine translation (MT) has reached human parity for the translation of news from Chinese into English, using pairwise ranking and

Achieving Human Parity on Automatic Chinese to English News Translation

It is found that Microsoft's latest neural machine translation system has reached a new state-of-the-art, and that the translation quality is at human parity when compared to professional human translations.

Neural Reranking Improves Subjective Quality of Machine Translation: NAIST at WAT2015

A detailed analysis of reasons for this increase in neural MT reranking is performed, finding that the main contributions of the neural models lie in improvement of the grammatical correctness of the output, as opposed to improvements in lexical choice of content words.

Neural versus Phrase-Based Machine Translation Quality: a Case Study

A detailed analysis of neural versus phrase-based SMT outputs is performed, leveraging high quality post-edits performed by professional translators on the IWSLT data and provides useful insights on what linguistic phenomena are best modeled by neural models.

Post-editese: an Exacerbated Translationese

It is found that PEs are simpler and more normalised and have a higher degree of interference from the source language than HTs.

Ten Years of WMT Evaluation Campaigns: Lessons Learnt

This paper reports on the experiences in running this evaluation campaign, the current state of the art in MT evaluation (both human and automatic), and the plans for future editions of WMT.

Approaches to Human and Machine Translation Quality Assessment

This chapter provides a critical overview of the established and developing approaches to the definition and measurement of translation quality in human and machine translation workflows across a range of research, educational, and industry scenarios.

Findings of the 2019 Conference on Machine Translation (WMT19)

This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any

Error Analysis of Statistical Machine Translation Output

A framework for classification of the errors of a machine translation system is presented and an error analysis of the system used by the RWTH in the first TC-STAR evaluation is carried out.

Findings of the 2018 Conference on Machine Translation (WMT18)

This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2018. Participants were asked to build machine translation systems for any