Bleu: a Method for Automatic Evaluation of Machine Translation

  title={Bleu: a Method for Automatic Evaluation of Machine Translation},
  author={Kishore Papineni and Salim Roukos and Todd Ward and Wei-Jing Zhu},
Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick… 
Survey of Machine Translation Evaluation
The evaluation of machine translation (MT) systems is an important and active research area. Many methods have been proposed to determine and optimize the output quality of MT systems. Because of the
Human Post-editing in Hybrid Machine Translation Systems: Automatic and Manual Analysis and Evaluation
There is evidence that MT can streamline the translation process for specific types of texts, such as questions; however, it does not yet rival the quality of human translations, to which post-editing is key in this process.
Evaluation of machine translation and its evaluation
The unigram-based F-measure has significantly higher correlation with human judgments than recently proposed alternatives and has an intuitive graphical interpretation, which can facilitate insight into how MT systems might be improved.
Correlating automated and human assessments of machine translation quality
It is suggested that when human evaluators are forced to make decisions without sufficient context or domain expertise, they fall back on strategies that are not unlike determining n-gram precision.
A Comparative Study and Analysis of Evaluation Matrices in Machine Translation
In this survey, different metrics under the automatic evaluation techniques in order to evaluate the output quality of MTS are discussed.
Human and Automatic Evaluation of English to Hindi Machine Translation Systems
This work presents the MT evaluation results of some of the machine translators available online for English-Hindi machine translation, measured on automatic evaluation metrics and human subjectivity measures.
Human Evaluation of Machine Translation Through Binary System Comparisons
It is shown how confidence ranges for state-of-the-art evaluation measures such as WER and TER can be computed accurately and efficiently without having to resort to Monte Carlo estimates.
HEVAL: Yet Another Human Evaluation Metric
A new human evaluation metric is proposed which addresses issues of inter-annotator agreement and repeatability in machine translation evaluation and provides solid grounds for making sound assumptions on the quality of the text produced by a machine translation.
(Meta-) Evaluation of Machine Translation
An extensive human evaluation was carried out not only to rank the different MT systems, but also to perform higher-level analysis of the evaluation process, revealing surprising facts about the most commonly used methodologies.
A Quantitative Method for Machine Translation Evaluation
This proposal attempts to measure the percentage of words, which should be modified at the output of an automatic translator in order to obtain a correct translation.


Corpus-based comprehensive and diagnostic MT evaluation: initial Arabic, Chinese, French, and Spanish results
Two metrics for automatic evaluation of machine translation quality, BLEU and NEE, are compared to human judgment of quality of translation of Arabic, Chinese, French, and Spanish documents into English.
The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches
This paper describes this evolutionary process, along with measurements of the most recent MT evaluation (January 1994) and the current evaluation process now underway, which is intended to provide a basis for measuring and thereby facilitating the progress of MT systems of the ARPAsponsored research program.
Additional mt-eval references
  • Technical report, International Standards for Language Engineering, Evaluation Working Group.
  • 2001
Additional mt-eval references International Standards for Language Engineering, Evaluation Working Group
  • Additional mt-eval references International Standards for Language Engineering, Evaluation Working Group
  • 2001
Toward finely differentiated evaluation metrics for machine translation
  • Proceedings of the Eagles Workshop on Standards and Evaluation, Pisa, Italy.
  • 1999