Learn More
In this paper, we examine a number of different phrase segmentation approaches for Machine Translation and how they perform when used to supplement the translation model of a phrase-based SMT system. This work represents a summary of a number of years of research carried out at Dublin City University in which it has been found that improvements can be made(More)
This paper describes the data collection and parallel corpus compilation activities carried out in the FP7 EU-funded SUMAT project. This project aims to develop an online subtitle translation service for nine European languages combined into 14 different language pairs. This data provides bilingual and monolingual training data for statistical machine(More)
Bilingual termbanks are important for many natural language processing (NLP) applications, especially in translation workflows in industrial settings. In this paper, we apply a log-likelihood comparison method to extract monolingual terminology from the source and target sides of a parallel corpus. Then, using a Phrase-Based Statistical Machine Translation(More)
We describe OpenMaTrEx, a free/open-source example-based machine translation (EBMT) system based on the marker hypothesis , comprising a marker-driven chunker, a collection of chunk align-ers, and two engines: one based on a simple proof-of-concept monotone EBMT recombinator and a Moses-based statistical decoder. OpenMa-TrEx is a free/open-source release of(More)
Data sparseness is a well-known problem for statistical machine translation (SMT) when morphologically rich and highly inflected languages are involved. This problem become worse in resource-scarce scenarios where sufficient parallel corpora are not available for model training. Recent research has shown that morphological segmentation can be employed on(More)
In this work we present a novel technique to rescore fragments in the Data-Oriented Translation model based on their contribution to translation accuracy. We describe three new rescoring methods, and present the initial results of a pilot experiment on a small subset of the Europarl corpus. This work is a proof-of-concept, and is the first step in directly(More)