Improving Evaluation of Machine Translation Quality Estimation

@inproceedings{Graham2015ImprovingEO,
  title={Improving Evaluation of Machine Translation Quality Estimation},
  author={Yvette Graham},
  booktitle={ACL},
  year={2015}
}
Quality estimation evaluation commonly takes the form of measurement of the error that exists between predictions and gold standard labels for a particular test set of translations. Issues can arise during comparison of quality estimation prediction score distributions and gold label distributions, however. In this paper, we provide an analysis of methods of comparison and identify areas of concern with respect to widely used measures, such as the ability to gain by prediction of aggregate… 
Improving Evaluation of Document-level Machine Translation Quality Estimation
TLDR
The validity of human annotations currently employed in the evaluation of document-level quality estimation for machine translation (MT) is explored, and the degree to which MT system rankings are dependent on weights employed inThe construction of the gold standard is demonstrated.
Is all that glitters in MT quality estimation really gold standard
TLDR
A range of quality estimation systems employing HTER and direct assessment (DA) of translation adequacy as gold labels are evaluated, resulting in a divergence in system rankings, and a proposed employment of DA for future quality estimation evaluations.
Is all that Glitters in Machine Translation Quality Estimation really Gold?
TLDR
A range of quality estimation systems employing HTER and direct assessment (DA) of translation adequacy as gold labels are evaluated, resulting in a divergence in system rankings, and a proposed employment of DA for future quality estimation evaluations.
Quality In, Quality Out: Learning from Actual Mistakes
TLDR
This paper presents the first attempt to model the task of predicting the proportion of actual translation errors in a sentence while minimising the need for direct human annotation using transfer-learning to leverage large scale noisy annotations and small sets of high-fidelity human annotated translation errors to train QE models.
Are we Estimating or Guesstimating Translation Quality?
TLDR
It is suggested that although QE models might capture fluency of translated sentences and complexity of source sentences, they cannot model adequacy of translations effectively.
Pushing the Limits of Translation Quality Estimation
TLDR
A new, carefully engineered, neural model is stacked into a rich feature-based word-level quality estimation system and the output of an automatic post-editing system is used as an extra feature, obtaining striking results on WMT16.
Exploring Prediction Uncertainty in Machine Translation Quality Estimation
Machine Translation Quality Estimation is a notoriously difficult task, which lessens its usefulness in real-world translation environments. Such scenarios can be improved if quality predictions are
Lightly Supervised Quality Estimation
TLDR
This paper proposes a framework for lightly supervised quality estimation by collecting manually annotated scores for a small number of segments in a test corpus or document, and combining them with automatically predicted quality scores for the remaining segments to predict an overall quality estimate.
Metrics for Evaluation of Word-level Machine Translation Quality Estimation
TLDR
Various metrics to replace F1-score for the “BAD” class, which is currently used as main metric are suggested and their performance on real system outputs and synthetically generated datasets are compared.
Estimating Post-Editing Effort with Translation Quality Features
TLDR
This thesis investigates using Quality Estimation (QE) to predict post-editing effort at the sentence level by focusing on the impact of features, using two datasets emulating real-world scenarios separated by the availability of post-edited training data.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 14 REFERENCES
Accurate Evaluation of Segment-level Machine Translation Metrics
TLDR
Three segment-level metrics — METEOR, NLEPOR and SENTBLEUMOSES — are found to correlate with human assessment at a level not significantly outperformed by any other metric in both the individual language pair assessment for Spanish-toEnglish and the aggregated set of 9 language pairs.
Estimating the Sentence-Level Quality of Machine Translation Systems
TLDR
Results show that the proposed method allows obtaining good estimates and that identifying a reduced set of relevant features plays an important role in predicting the quality of sentences produced by machine translation systems when reference translations are not available.
Testing for Significance of Increased Correlation with Human Judgment
TLDR
A significance test for comparing correlations of two metrics, along with an open-source implementation of the test, are introduced, which shows that for a high proportion of metrics, there is insufficient evidence to conclude significant improvement over BLEU.
Limitations of MT Quality Estimation Supervised Systems: The Tails Prediction Problem
TLDR
It is shown that standard supervised QE systems, usually trained to minimize MAE, make serious mistakes at predicting the quality of the sentences in the tails of the quality range.
Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric
TLDR
TER-Plus is explored, which is a new tunable MT metric that extends the Translation Edit Rate evaluation metric with tunable parameters and the incorporation of morphology, synonymy and paraphrases, demonstrating significant differences between the types of human judgments.
Randomized Significance Tests in Machine Translation
TLDR
A large-scale human evaluation of shared task systems for two language pairs to provide a gold standard for tests shows very little difference in accuracy across the three methods of significance testing.
Statistical Significance Tests for Machine Translation Evaluation
If two translation systems differ differ in performance on a test set, can we trust that this indicates a difference in true system quality? To answer this question, we describe bootstrap resampling
Findings of the 2014 Workshop on Statistical Machine Translation
This paper presents the results of the WMT14 shared tasks, which included a standard news translation task, a separate medical translation task, a task for run-time estimation of machine translation
Confidence Estimation for Machine Translation
TLDR
A detailed study of confidence estimation for machine translation, using data from the NIST 2003 Chinese-to-English MT evaluation to investigate various methods for determining whether MT output is correct.
Statistical Machine Translation
  • M. Osborne
  • Computer Science
    Encyclopedia of Machine Learning and Data Mining
  • 2017
TLDR
Statistical Machine Translation deals with automating sentences in one human language into another human language (such as English) and estimates from parallel corpora and also from monolingual corpora (examples of target sentences).
...
1
2
...