• Corpus ID: 229365669

YerevaNN’s Systems for WMT20 Biomedical Translation Task: The Effect of Fixing Misaligned Sentence Pairs

  title={YerevaNN’s Systems for WMT20 Biomedical Translation Task: The Effect of Fixing Misaligned Sentence Pairs},
  author={Karen Hambardzumyan and Hovhannes Tamoyan and H. Khachatrian},
This report describes YerevaNN’s neural machine translation systems and data processing pipelines developed for WMT20 biomedical translation task. We provide systems for English-Russian and English-German language pairs. For the English-Russian pair, our submissions achieve the best BLEU scores, with en\rightarrowru direction outperforming the other systems by a significant margin. We explain most of the improvements by our heavy data preprocessing pipeline which attempts to fix poorly aligned… 

Tables from this paper

Findings of the WMT 2020 Biomedical Translation Shared Task: Basque, Italian and Russian as New Additional Languages

In the fifth edition of the WMT Biomedical Task, the task addressed the evaluation of both scientific abstracts and terminologies and received submissions from a total of 20 teams.



Huawei’s NMT Systems for the WMT 2019 Biomedical Translation Task

Huawei’s neural machine translation systems for the WMT 2019 biomedical translation shared task are described, which achieve the best BLEU scores on English–French and English–German language pairs according to the official evaluation results.

UCAM Biomedical Translation at WMT19: Transfer Learning Multi-domain Ensembles

This work approached the 2019 WMT Biomedical translation task using transfer learning to obtain a series of strong neural models on distinct domains, and combining them into multi-domain ensembles using an adaptive language-model ensemble weighting scheme.

ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

ScispaCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library is described, which detail the performance of two packages of models released in scispa Cy and demonstrate their robustness on several tasks and datasets.

Facebook FAIR’s WMT19 News Translation Task Submission

This paper describes Facebook FAIR’s submission to the WMT19 shared news translation task and achieves the best case-sensitive BLEU score for the translation direction English→Russian.

Understanding Back-Translation at Scale

This work broadens the understanding of back-translation and investigates a number of methods to generate synthetic source sentences, finding that in all but resource poor settings back-translations obtained via sampling or noised beam outputs are most effective.

Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

A novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data is proposed and it is observed that a single synthetic bilingual corpus is able to improve results for other language pairs.

Moses: Open Source Toolkit for Statistical Machine Translation

We describe an open-source toolkit for statistical machine translation whose novel contributions are (a) support for linguistically motivated factors, (b) confusion network decoding, and (c)

Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

This paper proposes a new method for this task based on multilingual sentence embeddings, which relies on nearest neighbor retrieval with a hard threshold over cosine similarity, and accounts for the scale inconsistencies of this measure.

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

Fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks and supports distributed training across multiple GPUs and machines.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.