• Corpus ID: 5931915

Bilingual Multi-Word Term Tokenization for Chinese – Japanese Patent Translation

  title={Bilingual Multi-Word Term Tokenization for Chinese – Japanese Patent Translation},
  author={Wei Yang and Y. Lepage},
We propose to re-tokenize data with aligned bilingual multi-word terms to improve statistical machine translation (SMT) in technical domains. For that, we independently extract multi-word terms from the monolingual parts of the training data. Promising bilingual multi-word terms are then identified using the sampling-based alignment method by setting some threshold on translation probabilities. We estimate that the bilingual multi-word terms extracted are correct in more than 70 % of the cases… 

Figures and Tables from this paper



Sampling-based Multilingual Alignment

A sub-sentential alignment method that extracts high quality multi-word alignments from sentence-aligned multilingual parallel corpora that is competitive with state-of-the-art methods.

An Application and Evaluation of the C/NC-value Approach for the Automatic term Recognition of Multi-Word units in Japanese

Several experiments analysing the performance of the C/NC-value method using the NACSIS Japanese AI-domain corpus demonstrate that the method can be utilized to realize a practical domain-and language-independent term rec-ognition system.

Minimum Error Rate Training in Statistical Machine Translation

It is shown that significantly better results can often be obtained if the final evaluation criterion is taken directly into account as part of the training procedure.

Moses: Open Source Toolkit for Statistical Machine Translation

We describe an open-source toolkit for statistical machine translation whose novel contributions are (a) support for linguistically motivated factors, (b) confusion network decoding, and (c)

Bleu: a Method for Automatic Evaluation of Machine Translation

This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.


This paper conflrms that the performance of a state-of-the-art automatic term extraction method on a computer science corpus is similar to previously published performance data on a medical corpus.

Automatic recognition of multi-word terms:. the C-value/NC-value method

This paper presents a domain-independent method for the automatic extraction of multi-word terms, from machine-readable special language corpora, using C-value/NC-value, which enhances the common statistical measure of frequency of occurrence for term extraction, making it sensitive to a particular type ofMulti- word terms, the nested terms.

KenLM: Faster and Smaller Language Model Queries

KenLM is a library that implements two data structures for efficient language model queries, reducing both time and memory costs and is integrated into the Moses, cdec, and Joshua translation systems.