Learn More
We explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large general-domain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to(More)
Our participation in the IWSLT 2005 speech translation task is our first effort to work on limited domain speech data. We adapted our statistical machine translation system that performed successfully in previous DARPA competitions on open domain text translations. We participated in the supplied corpora transcription track. We achieved the highest BLEU(More)
OBJECTIVE Accurate, understandable public health information is important for ensuring the health of the nation. The large portion of the US population with Limited English Proficiency is best served by translations of public-health information into other languages. However, a large number of health departments and primary care clinics face significant(More)
We define a data model for storing geographic information from multiple sources that enables the efficient production of customizable gazetteers. The GazDB separates names from features while storing the relationships between them. Geographic names are stored in a variety of resolutions to allow for i18n and for multiplicity of naming. Geographic features(More)
This paper describes the systems of, and the experiments by, Microsoft Research Asia (MSRA), with the support of Microsoft Research (MSR), in the IWSLT 2010 evaluation campaign. We participated in all tracks of the DIALOG task (Chinese/English). While we follow the general training and decoding routine of statistical machine translation (SMT) and that of MT(More)
This paper describes the Microsoft Research (MSR) system for the evaluation campaign of the 2011 international workshop on spoken language translation. The evaluation task is to translate TED talks (www.ted.com). This task presents two unique challenges: First, the underlying topic switches sharply from talk to talk. Therefore, the translation system needs(More)
We broaden the application of data selection methods for domain adaptation to a larger number of languages, data, and decoders than shown in previous work, and explore comparable applications for both monolingual and bilingual cross-entropy difference methods. We compare domain adapted systems against very large general-purpose systems for the same(More)
We present a method that improves data selection by combining a hybrid word/part-of-speech representation for corpora, with the idea of distinguishing between rare and frequent events. We validate our approach using data selection for machine translation, and show that it maintains or improves BLEU and TER translation scores while substantially improving(More)
— The IWSLT benchmark task is an annual evaluation campaign on spoken language translation held by the International Workshop on Spoken Language Processing (IWSLT). The task is to translate TED talks (www.ted.com). This task presents two unique challenges: Firstly, the underlying topic switches sharply from talk to talk, and each one contains only tens to(More)
This paper describes the University of Washington's system for the 2009 International Workshop on Spoken Language Translation (IWSLT) evaluation campaign. Two systems were developed, one each for the BTEC Chinese-to-English and Arabic-to-English tracks. We describe experiments with different preprocessing and alignment combination schemes. Our main focus(More)