Corpus ID: 8938789

A Study of Translation Edit Rate with Targeted Human Annotation

@inproceedings{Snover2006ASO,
  title={A Study of Translation Edit Rate with Targeted Human Annotation},
  author={Matthew G. Snover and B. Dorr and R. Schwartz and Linnea Micciulla and John Makhoul},
  booktitle={AMTA},
  year={2006}
}
We examine a new, intuitive measure for evaluating machine-translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments. Translation Edit Rate (TER) measures the amount of editing that a human would have to perform to change a system output so it exactly matches a reference translation. We show that the single-reference variant of TER correlates as well with human judgments of MT quality as the four-reference variant… Expand

Tables from this paper

Predicting Human-Targeted Translation Edit Rate via Untrained Human Annotators
TLDR
S soliciting edits from untrained human annotators, via the online service Amazon Mechanical Turk, is explored, and it is shown that the collected data allows us to predict HTER-ranking of documents at a significantly higher level than the ranking obtained using automatic metrics. Expand
Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric
TLDR
TER-Plus is explored, which is a new tunable MT metric that extends the Translation Edit Rate evaluation metric with tunable parameters and the incorporation of morphology, synonymy and paraphrases, demonstrating significant differences between the types of human judgments. Expand
HUME: Human UCCA-Based Evaluation of Machine Translation
TLDR
A semantics-based evaluation of machine translation quality is argued for, which captures what meaning components are retained in the MT output, thus providing a more fine-grained analysis of translation quality, and enabling the construction and tuning of semantics- based MT. Expand
Multi-Hypothesis Machine Translation Evaluation
TLDR
This paper exploits the MT model uncertainty to generate multiple diverse translations and uses these as surrogates to reference translations to obtain a quantification of translation variability to either complement existing metric scores or replace references altogether. Expand
Meta-Evaluation of a Diagnostic Quality Metric for Machine Translation
TLDR
This paper evaluates DELiC4MT, a diagnostic metric that assesses the performance of MT systems on user-defined linguistic phenomena and observes that this diagnostic metric is capable of accurately reflecting translation quality, can be used reliably with automatic word alignments and correlates well with automatic metrics and, in general, with human judgements. Expand
Estimating Machine Translation Post-Editing Effort with HTER
Although Machine Translation (MT) has been attracting more and more attention from the translation industry, the quality of current MT systems still requires humans to post-edit translations toExpand
BLEUÂTRE: flattening syntactic dependencies for MT evaluation
TLDR
Using a statistical, treebanktrained parser, a novel approach to syntactically-informed evaluation of machine translation (MT) is described, which gains the benefit of syntactic analysis of the reference translations, but avoids the need to parse potentially noisy candidate translations. Expand
Predicting Machine Translation Adequacy
TLDR
This paper proposes a number of indicators contrasting the source and translation texts to predict the adequacy of such translations at the sentence-level, and shows that these indicators can yield improvements over previous work using general quality indicators based on source complexity and target fluency. Expand
Semantic vs. Syntactic vs. N-gram Structure for Machine Translation Evaluation
TLDR
The results strongly indicate that using semantic role labels for MT evaluation can be significantly more effective and better correlated with human judgement on adequacy than BLEU and STM. Expand
KoBE: Knowledge-Based Machine Translation Evaluation
TLDR
This work proposes a simple and effective method for machine translation evaluation which does not require reference translations, and achieves the highest correlation with human judgements on 9 out of the 18 language pairs from the WMT19 benchmark for evaluation without references. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 14 REFERENCES
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics
TLDR
NIST commissioned NIST to develop an MT evaluation facility based on the IBM work, which is now available from NIST and serves as the primary evaluation measure for TIDES MT research. Expand
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
TLDR
METEOR is described, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machineproduced translation and human-produced reference translations and can be easily extended to include more advanced matching strategies. Expand
Bleu: a Method for Automatic Evaluation of Machine Translation
TLDR
This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. Expand
Evaluation of machine translation and its evaluation
TLDR
The unigram-based F-measure has significantly higher correlation with human judgments than recently proposed alternatives and has an intuitive graphical interpretation, which can facilitate insight into how MT systems might be improved. Expand
A Paraphrase-Based Approach to Machine Translation Evaluation
TLDR
A novel approach to automatic machine translation evaluation based on paraphrase identification is proposed, which shows that models employing paraphrase-based features correlate better with human judgments than models based purely on existing automatic MT metrics. Expand
Automated Postediting of Documents
TLDR
This work argues for the construction of postediting modules that are portable across MT systems, as an alternative to hardcoding improvements inside any one system, and builds a complete self-contained postediting module for the task of article selection (a, an, the) for English noun phrases. Expand
An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research
TLDR
This paper defines evaluation criteria which are more adequate than pure edit distance and describes how the measurement along these quality criteria is performed semi-automatically in a fast, convenient and above all consistent way using this tool and the corresponding graphical user interface. Expand
Evaluating an NLG System using Post-Editing
TLDR
This work describes how post-edit data is used to evaluate SUMTIME-MOUSAM, an NLG system that produces weather forecasts, and the frequency and type of post-edits is a measure of how well the system works. Expand
Evaluating natural language processing systems
TLDR
Evaluating Natural Language Processing Systems Designing customized methods for testing various NLP systems may be costly and expensive, so post hoc justification is needed. Expand
SUMMAC: a text summarization evaluation
TLDR
Analysis of feedback forms filled in after each decision indicated that the intelligibility of present-day machine-generated summaries is high, and the evaluation methods used in the SUMMAC evaluation are of interest to both summarization evaluation as well as evaluation of other ‘output-related’ NLP technologies, where there may be many potentially acceptable outputs. Expand
...
1
2
...