Bilingual Multi-Word Term Tokenization for Chinese – Japanese Patent Translation
@inproceedings{Yang2017BilingualMT, title={Bilingual Multi-Word Term Tokenization for Chinese – Japanese Patent Translation}, author={Wei Yang and Y. Lepage}, year={2017} }
We propose to re-tokenize data with aligned bilingual multi-word terms to improve statistical machine translation (SMT) in technical domains. For that, we independently extract multi-word terms from the monolingual parts of the training data. Promising bilingual multi-word terms are then identified using the sampling-based alignment method by setting some threshold on translation probabilities. We estimate that the bilingual multi-word terms extracted are correct in more than 70 % of the cases…
References
SHOWING 1-8 OF 8 REFERENCES
Sampling-based Multilingual Alignment
- Computer ScienceRANLP
- 2009
A sub-sentential alignment method that extracts high quality multi-word alignments from sentence-aligned multilingual parallel corpora that is competitive with state-of-the-art methods.
An Application and Evaluation of the C/NC-value Approach for the Automatic term Recognition of Multi-Word units in Japanese
- Computer Science
- 2000
Several experiments analysing the performance of the C/NC-value method using the NACSIS Japanese AI-domain corpus demonstrate that the method can be utilized to realize a practical domain-and language-independent term rec-ognition system.
Minimum Error Rate Training in Statistical Machine Translation
- Computer ScienceACL
- 2003
It is shown that significantly better results can often be obtained if the final evaluation criterion is taken directly into account as part of the training procedure.
Moses: Open Source Toolkit for Statistical Machine Translation
- Computer ScienceACL
- 2007
We describe an open-source toolkit for statistical machine translation whose novel contributions are (a) support for linguistically motivated factors, (b) confusion network decoding, and (c)…
Bleu: a Method for Automatic Evaluation of Machine Translation
- Computer ScienceACL
- 2002
This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
AUTOMATIC TERM EXTRACTION AND DOCUMENT SIMILARITY IN SPECIAL TEXT CORPORA
- Computer Science
- 2003
This paper conflrms that the performance of a state-of-the-art automatic term extraction method on a computer science corpus is similar to previously published performance data on a medical corpus.…
Automatic recognition of multi-word terms:. the C-value/NC-value method
- Computer ScienceInternational Journal on Digital Libraries
- 2000
This paper presents a domain-independent method for the automatic extraction of multi-word terms, from machine-readable special language corpora, using C-value/NC-value, which enhances the common statistical measure of frequency of occurrence for term extraction, making it sensitive to a particular type ofMulti- word terms, the nested terms.
KenLM: Faster and Smaller Language Model Queries
- Computer ScienceWMT@EMNLP
- 2011
KenLM is a library that implements two data structures for efficient language model queries, reducing both time and memory costs and is integrated into the Moses, cdec, and Joshua translation systems.