A Call for Clarity in Reporting BLEU Scores

@inproceedings{Post2018ACF,
  title={A Call for Clarity in Reporting BLEU Scores},
  author={Matt Post},
  booktitle={WMT},
  year={2018}
}
  • Matt Post
  • Published in WMT 23 April 2018
  • Computer Science
The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric. Although people refer to “the” BLEU score, BLEU is in fact a parameterized metric whose values can vary wildly with changes to these parameters. These parameters are often not reported or are hard to find, and consequently, BLEU scores between papers cannot be directly compared. I quantify this variation, finding differences as high as 1.8 between… 

Figures and Tables from this paper

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics
TLDR
This work develops a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred and suggests improvements to the protocols for metric evaluation and system performance evaluation in machine translation.
A Large-Scale Test Set for the Evaluation of Context-Aware Pronoun Translation in Neural Machine Translation
TLDR
This paper presents a test suite of contrastive translations focused specifically on the translation of pronouns and shows that, while gains in BLEU are moderate for those systems, they outperform baselines by a large margin in terms of accuracy on the contrastive test set.
Lost in Machine Translation: A Method to Reduce Meaning Loss
TLDR
A method is presented to define a less ambiguous translation system in terms of an underlying pre-trained neural sequence-to-sequence model that increases injectivity, resulting in greater preservation of meaning as measured by improvement in cycle-consistency, without impeding translation quality.
Query-Key Normalization for Transformers
TLDR
QKNorm, a normalization technique that modifies the attention mechanism to make the softmax function less prone to arbitrary saturation without sacrificing expressivity is proposed.
Explicit Representation of the Translation Space: Automatic Paraphrasing for Machine Translation Evaluation
TLDR
This work assesses the feasibility of improving BLEU using state-of-the-art neural paraphrasing techniques to generate additional references and explores the extent to which diverse paraphrases can adequately cover the space of valid translations and an alternative approach of generating paraphrase constrained by MT outputs.
CLIReval: Evaluating Machine Translation as a Cross-Lingual Information Retrieval Task
TLDR
Results suggest CLIReval is competitive in many language pairs in terms of correlation to human judgments of quality, and is not intended to replace popular intrinsic metrics such as BLEU.
Who Are We Talking About? Handling Person Names in Speech Translation
Recent work has shown that systems for speech translation (ST) – similarly to automatic speech recognition (ASR) – poorly handle person names. This shortcoming does not only lead to errors that can
On The Evaluation of Machine Translation SystemsTrained With Back-Translation
TLDR
Empirical evidence is provided to support the view that back-translation is preferred by humans because it produces more fluent outputs and to recommend complementing BLEU with a language model score to measure fluency.
Reproducibility Issues for BERT-based Evaluation Metrics
TLDR
This paper asks whether results and claims from four recent BERT-based metrics can be reproduced and finds that reproduction of claims and results often fails because of heavy undocumented preprocessing involved in the metrics, missing code and reporting weaker results for the baseline metrics.
Data Processing Matters: SRPH-Konvergen AI’s Machine Translation System for WMT’21
TLDR
Despite using only a standard Transformer, the submission of the joint Samsung Research Philippines-Konvergen AI team for the WMT’21 Large Scale Multilingual Translation Task - Small Track 2 ranked first in Indonesian to Javanese, showing that data preprocessing matters equally, if not more, than cutting edge model architectures and training techniques.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 23 REFERENCES
A Structured Review of the Validity of BLEU
TLDR
The evidence supports using BLEU for diagnostic evaluation of MT systems, but does not support using it outside of MT, for evaluation of individual texts, or for scientific hypothesis testing.
Re-evaluating the Role of Bleu in Machine Translation Research
TLDR
It is shown that an improved Bleu score is neither necessary nor sufficient for achieving an actual improvement in translation quality, and two significant counterexamples to Bleu’s correlation with human judgments of quality are given.
Treebank Annotation Schemes and Parser Evaluation for German
TLDR
The results of the experiments show that, contrary to K¤ ubler et al. (2006), the question whether or not German is harder to parse than English remains undecided.
Addressing the Rare Word Problem in Neural Machine Translation
TLDR
This paper proposes and implements an effective technique to address the problem of end-to-end neural machine translation's inability to correctly translate very rare words, and is the first to surpass the best result achieved on a WMT’14 contest task.
Is Machine Translation Getting Better over Time?
TLDR
A large-scale crowd-sourcing experiment is carried out to estimate the degree to which state-of-theart performance in machine translation has increased over the past five years, with Czech-to-English translation standing out as the language pair achieving most substantial gains.
Bleu: a Method for Automatic Evaluation of Machine Translation
TLDR
This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
Effective Approaches to Attention-based Neural Machine Translation
TLDR
A global approach which always attends to all source words and a local one that only looks at a subset of source words at a time are examined, demonstrating the effectiveness of both approaches on the WMT translation tasks between English and German in both directions.
Neural Machine Translation of Rare Words with Subword Units
TLDR
This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.
A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars
The problem of quantitatively comparing the performance of different broad-coverage grammars of English has to date resisted solution. Prima facie, known English grammars appear to disagree strongly
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
TLDR
GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.
...
1
2
3
...