• Corpus ID: 5189165

LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors

  title={LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors},
  author={Lifeng Han and Derek F. Wong and Lidia S. Chao},
In the conventional evaluation metrics of machine translation, considering less information about the translations usually makes the result not reasonable and low correlation with human judgments. On the other hand, using many external linguistic resources and tools (e.g. Part-ofspeech tagging, morpheme, stemming, and synonyms) makes the metrics complicated, timeconsuming and not universal due to that different languages have the different linguistic features. This paper proposes a novel… 

Figures and Tables from this paper

Language-independent Model for Machine Translation Evaluation with Reinforced Factors

A novel language-independent evaluation metric is proposed in this work with enhanced factors and optional linguistic information (part-of-speech, n-grammar) but not very much to make the metric perform well on different language pairs.

LEPOR: An Augmented Machine Translation Evaluation Metric

Novel MT evaluation methods are designed where weighting of factors can be optimised according to the characteristics of languages and concise linguistic feature using POS is designed to show that the methods can yield even higher performance when using some external linguistic resources.

Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation

An unsupervised MT evaluation metric using universal part-of-speech tagset without relying on reference translations is proposed, which shows that the designed methods yield higher correlation scores with human judgments.

Difficulty-Aware Machine Translation Evaluation

A novel difficulty-aware MT evaluation metric is proposed, expanding the evaluation dimension by taking translation difficulty into consideration, and shows that the proposed method outperforms commonly used MT metrics in terms of human correlation.

Adequacy–Fluency Metrics: Evaluating MT in the Continuous Space Model Framework

This work extends and evaluates a two-dimensional automatic evaluation metric for machine translation, which is designed to operate at the sentence level. The metric is based on the concepts of

A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task

This paper is to describe the machine translation evaluation systems used for participation in the WMT13 shared Metrics Task and two automatic MT evaluation systems nLEPOR_baseline and LEPOR_v3.1.

Automatic Machine Translation Evaluation with Part-of-Speech Information

This paper explores the evaluation only using Part-of-Speech information of the words, which means the method is based only on the consilience of the POS strings of the hypothesis translation and reference, which acts as the similar function with the synonyms in addition to its syntactic or morphological behaviour of the lexical item in question.

How to evaluate machine translation: A review of automated and human metrics

The most up-to-date, influential automated, semiautomated and human metrics used to evaluate the quality of machine translation (MT) output are presented and provides the necessary background for MT evaluation projects.

A global analysis of metrics used for measuring performance in natural language processing

The results suggest that the large majority of natural language processing metrics currently used have properties that may result in an inadequate reflection of a models’ performance, and ambiguities and inconsistencies in the reporting of metrics may lead to difficulties in interpreting and comparing model performances, impairing transparency and reproducibility in NLP research.

Machine Translation Evaluation: A Survey

The state-of-the-art machine translation (MT) evaluation survey that contains both manual and automatic evaluation methods is introduced and the different classifications from manual to automatic evaluation measures are introduced.



Evaluation without references: IBM1 scores as evaluation metrics

A truly automatic evaluation metric based on ibm1 lexicon probabilities which does not need any reference translations is proposed, with the most promising being ibm 1 scores calculated on morphemes and pos-4grams.

Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

NIST commissioned NIST to develop an MT evaluation facility based on the IBM work, which is now available from NIST and serves as the primary evaluation measure for TIDES MT research.

METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments

The technical details underlying the Meteor metric are recapped, the latest release includes improved metric parameters and extends the metric to support evaluation of MT output in Spanish, French and German, in addition to English.


We define a new, intuitive measure for evaluating machine translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments.

A Study of Translation Edit Rate with Targeted Human Annotation

A new, intuitive measure for evaluating machine-translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments is examined, which indicates that HTER correlates with human judgments better than HMETEOR and that the four-reference variants of TER and HTER correlate withhuman judgments as well as—or better than—a second human judgment does.

Automatic Evaluation of Translation Quality for Distant Language Pairs

An automatic evaluation metric based on rank correlation coefficients modified with precision is proposed and meta-evaluation of the NTCIR-7 PATMT JE task data shows that this metric outperforms conventional metrics.

Linguistic Features for Automatic Evaluation of Heterogenous MT Systems

Experimental results are provided showing that metrics based on deeper linguistic information (syntactic/shallow-semantic) are able to produce more reliable system rankings than metricsbased on lexical matching alone, specially when the systems under evaluation are of a different nature.

A Lightweight Evaluation Framework for Machine Translation Reordering

A simple framework for evaluating word order independently of lexical choice by comparing the system's reordering of a source sentence to reference reordering data generated from manually word-aligned translations, and shows how the alignments are generated can significantly effect the robustness of the evaluation.

Evaluation of Machine Translation Metrics for Czech as the Target Language

The main goal of this article is to compare metrics with respect to their correlation with human judgments for Czech as the target language and to propose the best ones that can be used for an evaluation of MT systems translating into Czech language.

Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems

This paper describes Meteor 1.3, our submission to the 2011 EMNLP Workshop on Statistical Machine Translation automatic evaluation metric tasks. New metric features include improved text