On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation

@inproceedings{Zhao2020OnTL,
  title={On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation},
  author={Wei Zhao and Goran Glavavs and Maxime Peyrard and Yang Gao and Robert West and Steffen Eger},
  booktitle={ACL},
  year={2020}
}
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual textual similarity. In this paper, we concern ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations, which represents a natural adversarial setup for multilingual encoders. Reference-free evaluation holds the promise of web-scale… 
Machine Translation Reference-less Evaluation using YiSi-2 with Bilingual Mappings of Massive Multilingual Language Model
TLDR
YiSi-2’s correlation with human direct assessment on translation quality is greatly improved by replacing multilingual BERT with XLM-RoBERTa and projecting the source embeddings into the tar- get embedding space using a cross-lingual lin- ear projection (CLP) matrix learnt from a small development set.
COMET: A Neural Framework for MT Evaluation
TLDR
This framework leverages recent breakthroughs in cross-lingual pretrained language modeling resulting in highly multilingual and adaptable MT evaluation models that exploit information from both the source input and a target-language reference translation in order to more accurately predict MT quality.
KoBE: Knowledge-Based Machine Translation Evaluation
TLDR
This work proposes a simple and effective method for machine translation evaluation which does not require reference translations, and achieves the highest correlation with human judgements on 9 out of the 18 language pairs from the WMT19 benchmark for evaluation without references.
Sometimes We Want Translationese
TLDR
This paper proposes a simple, novel way to quantify whether an NMT system exhibits robustness and faithfulness, focusing on the case of word-order perturbations, and explores a suite of functions to perturb the word order of source sentences without deleting or injecting tokens.
Improving Parallel Data Identification using Iteratively Refined Sentence Alignments and Bilingual Mappings of Pre-trained Language Models
The National Research Council of Canada’s team submissions to the parallel corpus filtering task at the Fifth Conference on Machine Translation are based on two key components: (1) iteratively
Is Supervised Syntactic Parsing Beneficial for Language Understanding Tasks? An Empirical Investigation
TLDR
This work empirically investigate the usefulness of supervised parsing for semantic LU in the context of LM-pretrained transformer networks, and results show that explicit formalized syntax, injected into transformers through IPT, has very limited and inconsistent effect on downstream LU performance.
Identifying Elements Essential for BERT’s Multilinguality
TLDR
Overall, four architectural and two linguistic elements that influence multilinguality in BERT are identified and an efficient setup with small BERT models trained on a mix of synthetic and natural data is proposed.
Identifying Elements Essential for BERT’s Multilinguality
TLDR
Overall, four architectural and two linguistic elements that influence multilinguality in BERT are identified and an efficient setup with small BERT models trained on a mix of synthetic and natural data is proposed.
Probing Multilingual BERT for Genetic and Typological Signals
TLDR
The layers in multilingual BERT (mBERT) are probed for phylogenetic and geographic language signals across 100 languages and language distances based on the mBERT representations are computed, finding that they are close to the reference family tree in terms of quartet tree distance.
Reference-Free Word- and Sentence-Level Translation Evaluation with Token-Matching Metrics
Many modern machine translation evaluation metrics like BERTScore, BLEURT, COMET, MonoTransquest or XMoverScore are based on black-box language models. Hence, it is difficult to explain why these
...
1
2
3
...

References

SHOWING 1-10 OF 63 REFERENCES
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
TLDR
An architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts using a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora.
XMEANT: Better semantic MT evaluation without reference translations
We introduce XMEANT—a new cross-lingual version of the semantic frame based MT evaluation metric MEANT—which can correlate even more closely with human adequacy judgments than monolingual MEANT and
On The Evaluation of Machine Translation SystemsTrained With Back-Translation
TLDR
Empirical evidence is provided to support the view that back-translation is preferred by humans because it produces more fluent outputs and to recommend complementing BLEU with a language model score to measure fluency.
On the Limitations of Unsupervised Bilingual Dictionary Induction
TLDR
It is shown that a simple trick, exploiting a weak supervision signal from identical words, enables more robust induction and establishes a near-perfect correlation between unsupervised bilingual dictionary induction performance and a previously unexplored graph similarity metric.
How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions
TLDR
It is empirically demonstrate that the performance of CLE models largely depends on the task at hand and that optimizing CLE models for BLI may hurt downstream performance, and indicates the most robust supervised and unsupervised CLE models.
Do We Really Need Fully Unsupervised Cross-Lingual Embeddings?
TLDR
It is shown that fully unsupervised CLWE methods still fail for a large number of language pairs and never surpass the performance of weakly supervised methods using the same self-learning procedure in any BLI setup, and the gaps are often substantial.
Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing
TLDR
A novel method for multilingual transfer that utilizes deep contextual embeddings, pretrained in an unsupervised fashion, that consistently outperforms the previous state-of-the-art on 6 tested languages, yielding an improvement of 6.8 LAS points on average.
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
TLDR
This work proposes a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages using a shared wordpiece vocabulary, and introduces an artificial token at the beginning of the input sentence to specify the required target language.
Unsupervised Cross-lingual Representation Learning at Scale
TLDR
It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.
Multilingual Universal Sentence Encoder for Semantic Retrieval
TLDR
On transfer learning tasks, the multilingual embeddings approach, and in some cases exceed, the performance of English only sentence embedDings.
...
1
2
3
4
5
...