• Corpus ID: 245650254

How do lexical semantics affect translation? An empirical study

@article{Subramanian2022HowDL,
  title={How do lexical semantics affect translation? An empirical study},
  author={Vivek Subramanian and Dhanasekar Sundararaman},
  journal={ArXiv},
  year={2022},
  volume={abs/2201.00075}
}
Neural machine translation (NMT) systems aim to map text from one language into another. While there are a wide variety of applications of NMT, one of the most important is translation of natural language. A distinguishing factor of natural language is that words are typically ordered according to the rules of the grammar of a given language. Although many advances have been made in developing NMT systems for translating natural language, little research has been done on understanding how the… 

Figures and Tables from this paper

Number Entity Recognition

TLDR
This work proposes a classification of numbers into entities that helps NLP models perform well on several tasks, including a handcrafted Fill-In-The-Blank (FITB) task and on question answering using joint embeddings, outperforming the BERT and RoBERTa baseline classi-cation.

Improving Downstream Task Performance by Treating Numbers as Entities

TLDR
This work proposes a classification of numbers into entities that helps NLP models perform well on several tasks, including a handcrafted Fill-In-The-Blank (FITB) task and on question answering using joint embeddings, outperforming the BERT and RoBERTa baseline classi-cation.

Exploring Gender Bias in Retrieval Models

TLDR
It is shown that pre-trained models for IR do not perform well in zero-shot retrieval tasks when full-tuning of a large pre- trained BERT encoder is performed and that lightweight tuning performed with adapter networks improves zero- shot retrieval performance almost by 20% over baseline.

Debiasing Gender Bias in Information Retrieval Models

TLDR
It is shown that pre-trained models for IR do not perform well in zero-shot retrieval tasks when full-tuning of a large pre- trained BERT encoder is performed and that lightweight tuning performed with adapter networks improves zero- shot retrieval performance almost by 20% over baseline.

References

SHOWING 1-10 OF 33 REFERENCES

Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

TLDR
This work proposes a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages using a shared wordpiece vocabulary, and introduces an artificial token at the beginning of the input sentence to specify the required target language.

Linguistic Input Features Improve Neural Machine Translation

TLDR
This paper generalizes the embedding layer of the encoder in the attentional encoder--decoder architecture to support the inclusion of arbitrary features, in addition to the baseline word feature, and finds that linguistic input features improve model quality according to three metrics: perplexity, BLEU and CHRF3.

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

TLDR
GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.

Neural Machine Translation by Jointly Learning to Align and Translate

TLDR
It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

Achieving Human Parity on Automatic Chinese to English News Translation

TLDR
It is found that Microsoft's latest neural machine translation system has reached a new state-of-the-art, and that the translation quality is at human parity when compared to professional human translations.

The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages

TLDR
A new, unique and freely available parallel corpus containing European Union documents of mostly legal nature, available in all 20 official EU languages, which is particularly suitable to carry out all types of cross-language research and to test and benchmark text analysis software across different languages.

Syntax-Infused Transformer and BERT models for Machine Translation and Natural Language Understanding

TLDR
This work shows that the syntax-infused Transformer with multiple features achieves an improvement of 0.7 BLEU when trained on the full WMT 14 English to German translation dataset, and finds that the incorporation of syntax into BERT fine-tuning outperforms baseline on a number of downstream tasks from the GLUE benchmark.

Massively Multilingual Neural Machine Translation

TLDR
It is shown that massively multilingual many-to-many models are effective in low resource settings, outperforming the previous state-of-the-art while supporting up to 59 languages in 116 translation directions in a single model.

Europarl: A Parallel Corpus for Statistical Machine Translation

TLDR
A corpus of parallel text in 11 languages from the proceedings of the European Parliament is collected and its acquisition and application as training data for statistical machine translation (SMT) is focused on.

Sequence to Sequence Learning with Neural Networks

TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.