• Corpus ID: 246035768

Syntax-based data augmentation for Hungarian-English machine translation

  title={Syntax-based data augmentation for Hungarian-English machine translation},
  author={Attila Nagy and Patrick Nanys and Bal{\'a}zs Frey Konr{\'a}d and Bence Bial and Judit {\'A}cs},
We train Transformer-based neural machine translation models for Hungarian-English and English-Hungarian using the Hunglish2 corpus. Our best models achieve a BLEU score of 40.0 on HungarianEnglish and 33.4 on English-Hungarian. Furthermore, we present results on an ongoing work about syntax-based augmentation for neural machine translation. Both our code and models are publicly available. 


Syntax-aware Data Augmentation for Neural Machine Translation
The result of extensive experiments show the proposed syntax-aware data augmentation method may effectively boost existing sentence-independent methods for significant translation performance improvement.
The University of Edinburgh’s English-German and English-Hausa Submissions to the WMT21 News Translation Task
This paper presents the University of Edinburgh’s constrained submissions of English-German and English-Hausa systems to the WMT 2021 shared task on news translation. We build En-De systems in three
OpenNMT: Open-Source Toolkit for Neural Machine Translation
The toolkit prioritizes efficiency, modularity, and extensibility with the goal of supporting NMT research into model architectures, feature representations, and source modalities, while maintaining competitive performance and reasonable training requirements.
Data Augmentation via Dependency Tree Morphing for Low-Resource Languages
It is shown that crop and rotate provides improvements over the models trained with non-augmented data for majority of the languages, especially for languages with rich case marking systems.
English to Hungarian Morpheme-based Statistical Machine Translation System with Reordering Rules
A method is presented that tries to overcome problems in the case of English-Hungarian translation by apply- ing reordering rules prior to the translation process and by creating morpheme-based and factored models.
Data Augmentation via Subtree Swapping for Dependency Parsing of Low-Resource Languages
A new data augmentation method for artificially creating new dependency-annotated sentences by swapping subtrees between annotated sentences while enforcing strong constraints on those trees to ensure maximal grammaticality of the new sentences is presented.
Improving Neural Machine Translation Models with Monolingual Data
This work pairs monolingual training data with an automatic back-translation, and can treat it as additional parallel training data, and obtains substantial improvements on the WMT 15 task English German, and for the low-resourced IWSLT 14 task Turkish->English.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, finds that it is possible to achieve comparable accuracy to direct subword training from raw sentences.
Universal Dependencies v1: A Multilingual Treebank Collection
This paper describes v1 of the universal guidelines, the underlying design principles, and the currently available treebanks for 33 languages, as well as highlighting the needs for sound comparative evaluation and cross-lingual learning experiments.
A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages
This work systematically compares a set of simple strategies for improving low-resource parsers: data augmentation, which has not been tested before; cross-lingual training; and transliteration.