• Corpus ID: 8938789

A Study of Translation Edit Rate with Targeted Human Annotation

  title={A Study of Translation Edit Rate with Targeted Human Annotation},
  author={Matthew G. Snover and B. Dorr and R. Schwartz and Linnea Micciulla and John Makhoul},
  booktitle={Conference of the Association for Machine Translation in the Americas},
We examine a new, intuitive measure for evaluating machine-translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments. Translation Edit Rate (TER) measures the amount of editing that a human would have to perform to change a system output so it exactly matches a reference translation. We show that the single-reference variant of TER correlates as well with human judgments of MT quality as the four-reference variant… 

Tables from this paper

Predicting Human-Targeted Translation Edit Rate via Untrained Human Annotators

S soliciting edits from untrained human annotators, via the online service Amazon Mechanical Turk, is explored, and it is shown that the collected data allows us to predict HTER-ranking of documents at a significantly higher level than the ranking obtained using automatic metrics.

Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric

TER-Plus is explored, which is a new tunable MT metric that extends the Translation Edit Rate evaluation metric with tunable parameters and the incorporation of morphology, synonymy and paraphrases, demonstrating significant differences between the types of human judgments.

HUME: Human UCCA-Based Evaluation of Machine Translation

A semantics-based evaluation of machine translation quality is argued for, which captures what meaning components are retained in the MT output, thus providing a more fine-grained analysis of translation quality, and enabling the construction and tuning of semantics- based MT.

Multi-Hypothesis Machine Translation Evaluation

This paper exploits the MT model uncertainty to generate multiple diverse translations and uses these as surrogates to reference translations to obtain a quantification of translation variability to either complement existing metric scores or replace references altogether.

Meta-Evaluation of a Diagnostic Quality Metric for Machine Translation

This paper evaluates DELiC4MT, a diagnostic metric that assesses the performance of MT systems on user-defined linguistic phenomena and observes that this diagnostic metric is capable of accurately reflecting translation quality, can be used reliably with automatic word alignments and correlates well with automatic metrics and, in general, with human judgements.

BLEUÂTRE: flattening syntactic dependencies for MT evaluation

Using a statistical, treebanktrained parser, a novel approach to syntactically-informed evaluation of machine translation (MT) is described, which gains the benefit of syntactic analysis of the reference translations, but avoids the need to parse potentially noisy candidate translations.

Rethink about the Word-level Quality Estimation for Machine Translation from Human Judgement

The results not only show the proposed dataset is more consistent with human judgment but also shows the effectiveness of the proposed tag correcting strategies.

Predicting Machine Translation Adequacy

This paper proposes a number of indicators contrasting the source and translation texts to predict the adequacy of such translations at the sentence-level, and shows that these indicators can yield improvements over previous work using general quality indicators based on source complexity and target fluency.

Semantic vs. Syntactic vs. N-gram Structure for Machine Translation Evaluation

The results strongly indicate that using semantic role labels for MT evaluation can be significantly more effective and better correlated with human judgement on adequacy than BLEU and STM.

KoBE: Knowledge-Based Machine Translation Evaluation

This work proposes a simple and effective method for machine translation evaluation which does not require reference translations, and achieves the highest correlation with human judgements on 9 out of the 18 language pairs from the WMT19 benchmark for evaluation without references.



Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

NIST commissioned NIST to develop an MT evaluation facility based on the IBM work, which is now available from NIST and serves as the primary evaluation measure for TIDES MT research.

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

METEOR is described, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machineproduced translation and human-produced reference translations and can be easily extended to include more advanced matching strategies.

Bleu: a Method for Automatic Evaluation of Machine Translation

This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

Evaluation of machine translation and its evaluation

The unigram-based F-measure has significantly higher correlation with human judgments than recently proposed alternatives and has an intuitive graphical interpretation, which can facilitate insight into how MT systems might be improved.

A Paraphrase-Based Approach to Machine Translation Evaluation

A novel approach to automatic machine translation evaluation based on paraphrase identification is proposed, which shows that models employing paraphrase-based features correlate better with human judgments than models based purely on existing automatic MT metrics.

An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research

This paper defines evaluation criteria which are more adequate than pure edit distance and describes how the measurement along these quality criteria is performed semi-automatically in a fast, convenient and above all consistent way using this tool and the corresponding graphical user interface.

Evaluating an NLG System using Post-Editing

This work describes how post-edit data is used to evaluate SUMTIME-MOUSAM, an NLG system that produces weather forecasts, and the frequency and type of post-edits is a measure of how well the system works.

SUMMAC: a text summarization evaluation

Analysis of feedback forms filled in after each decision indicated that the intelligibility of present-day machine-generated summaries is high, and the evaluation methods used in the SUMMAC evaluation are of interest to both summarization evaluation as well as evaluation of other ‘output-related’ NLP technologies, where there may be many potentially acceptable outputs.

Edit Distance with Move Operations

Three Heads are Better than One

Health minister with responsibility for emergency care Hazel Blears helped launch a £1.4 million NHS walk-in centre in St Helens, Merseyside, last month. As part of her tour of emergency services in