(Meta-) Evaluation of Machine Translation

@inproceedings{CallisonBurch2007MetaEO,
  title={(Meta-) Evaluation of Machine Translation},
  author={Chris Callison-Burch and Cameron S. Fordyce and Philipp Koehn and Christof Monz and Josh Schroeder},
  booktitle={WMT@ACL},
  year={2007}
}
This paper evaluates the translation quality of machine translation systems for 8 language pairs: translating French, German, Spanish, and Czech to English and back. We carried out an extensive human evaluation which allowed us not only to rank the different MT systems, but also to perform higher-level analysis of the evaluation process. We measured timing and intra- and inter-annotator agreement for three types of subjective evaluation. We measured the correlation of automatic evaluation… 
Further Meta-Evaluation of Machine Translation
TLDR
This paper analyzes the translation quality of machine translation systems for 10 language pairs translating between Czech, English, French, German, Hungarian, and Spanish and uses the human judgments of the systems to analyze automatic evaluation metrics for translation quality.
Evaluation of Machine Translation Metrics for Czech as the Target Language
TLDR
The main goal of this article is to compare metrics with respect to their correlation with human judgments for Czech as the target language and to propose the best ones that can be used for an evaluation of MT systems translating into Czech language.
Evaluating Machine Translation Quality Using Short Segments Annotations
TLDR
A manual evaluation method is proposed for machine translation (MT), in which annotators rank only translations of short segments instead of whole sentences, which results in an easier and more efficient annotation.
Syntax-Oriented Evaluation Measures for Machine Translation Output
We explored novel automatic evaluation measures for machine translation output oriented to the syntactic structure of the sentence: the Bleu score on the detailed Part-of-Speech (pos) tags as well as
Deeper Machine Translation and Evaluation for German
TLDR
This paper describes a hybrid Machine Translation system built for translating from English to German in the domain of technical documentation that is based on three different MT engines that are joined by a selection mechanism that uses deep linguistic features within a machine learning process.
Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation
TLDR
A large-scale manual evaluation of 104 machine translation systems and 41 system combination entries was conducted, which used the ranking of these systems to measure how strongly automatic metrics correlate with human judgments of translation quality for 26 metrics.
Evaluating the morphological competence of Machine Translation Systems
TLDR
A new type of evaluation focused specifically on the morphological competence of a system with respect to various grammatical phenomena is proposed, which uses automatically generated pairs of source sentences, where each pair tests one morphological contrast.
Measures of Machine Translation Quality
TLDR
An annotation experiment is conducted and a manual evaluation method in which annotators rank only translations of short segments instead of whole sentences is proposed, which results in easier and more efficient annotation.
HEVAL: Yet Another Human Evaluation Metric
TLDR
A new human evaluation metric is proposed which addresses issues of inter-annotator agreement and repeatability in machine translation evaluation and provides solid grounds for making sound assumptions on the quality of the text produced by a machine translation.
Findings of the 2012 Workshop on Statistical Machine Translation
TLDR
A large-scale manual evaluation of 103 machine translation systems submitted by 34 teams was conducted, which used the ranking of these systems to measure how strongly automatic metrics correlate with human judgments of translation quality for 12 evaluation metrics.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 57 REFERENCES
Correlating automated and human assessments of machine translation quality
TLDR
It is suggested that when human evaluators are forced to make decisions without sufficient context or domain expertise, they fall back on strategies that are not unlike determining n-gram precision.
Re-evaluating Machine Translation Results with Paraphrase Support
TLDR
ParaEval is presented, an automatic evaluation framework that uses paraphrases to improve the quality of machine translation evaluations and correlates significantly better than BLEU with human assessment in measurements for both fluency and adequacy.
Bleu: a Method for Automatic Evaluation of Machine Translation
TLDR
This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Re-evaluating the Role of Bleu in Machine Translation Research
TLDR
It is shown that an improved Bleu score is neither necessary nor sufficient for achieving an actual improvement in translation quality, and two significant counterexamples to Bleu’s correlation with human judgments of quality are given.
Manual and Automatic Evaluation of Machine Translation between European Languages
We evaluated machine translation performance for six European language pairs that participated in a shared task: translating French, German, Spanish texts to English and back. Evaluation was done
English-to-Czech Factored Machine Translation
TLDR
Experimental results demonstrate significant improvement of translation quality in terms of BLEU.
NIST 2005 machine translation evaluation official results
TLDR
The NIST 2005 Machine Translation Evaluation (MT-05) was part of an ongoing series of evaluations of human language translation technology and provided an important contribution to the direction of research efforts and the calibration of technical capabilities.
Getting to Know Moses: Initial Experiments on German-English Factored Translation
TLDR
The paper is based on the idea of using an off-the-shelf parser to supply linguistic information to a factored translation model and compares the results of German---English translation to the shared task baseline system based on word form.
METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments
TLDR
The technical details underlying the Meteor metric are recapped, the latest release includes improved metric parameters and extends the metric to support evaluation of MT output in Spanish, French and German, in addition to English.
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics
TLDR
NIST commissioned NIST to develop an MT evaluation facility based on the IBM work, which is now available from NIST and serves as the primary evaluation measure for TIDES MT research.
...
1
2
3
4
5
...