Learn More
We explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large general-domain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to(More)
Our participation in the IWSLT 2005 speech translation task is our first effort to work on limited domain speech data. We adapted our statistical machine translation system that performed successfully in previous DARPA competitions on open domain text translations. We participated in the supplied corpora transcription track. We achieved the highest BLEU(More)
This document describes the first NIST MT Evaluation submission of the newly formed Edinburgh University Statistical Machine Translation Group. Our entry to the 2005 DARPA/NIST MT Evaluation was largely based on the 2004 MIT system. In a two month effort we fo-cused on adding more data and a few new features to our Arabic-English system. We also worked on(More)
OBJECTIVE Accurate, understandable public health information is important for ensuring the health of the nation. The large portion of the US population with Limited English Proficiency is best served by translations of public-health information into other languages. However, a large number of health departments and primary care clinics face significant(More)
We present a method that improves data selection by combining a hybrid word/part-of-speech representation for corpora, with the idea of distinguishing between rare and frequent events. We validate our approach using data selection for machine translation, and show that it maintains or improves BLEU and TER translation scores while substantially improving(More)
The authors measured the rate and determinants of posttraumatic stress disorder (PTSD) in a group of cancer survivors. Patients who had a history of cancer diagnosis with at least 3 years since diagnosis, receiving no active treatment, such as chemotherapy or radiation, were interviewed (N = 27). Patients, who were part of the DSM-IV PTSD field trial, were(More)
Machine translation systems, as a whole, are currently not able to use the output of linguistic tools, such as part-of-speech taggers, to effectively improve translation performance. However, a new language modeling technique, Factored Language Models can incorporate the additional linguistic information that is produced by these tools. In the field of(More)
We define a data model for storing geographic information from multiple sources that enables the efficient production of customizable gazetteers. The GazDB separates names from features while storing the relationships between them. Geographic names are stored in a variety of resolutions to allow for i18n and for multiplicity of naming. Geographic features(More)
This paper describes the Microsoft Research (MSR) system for the evaluation campaign of the 2011 international workshop on spoken language translation. The evaluation task is to translate TED talks (www.ted.com). This task presents two unique challenges: First, the underlying topic switches sharply from talk to talk. Therefore, the translation system needs(More)
This paper describes the systems of, and the experiments by, Microsoft Research Asia (MSRA), with the support of Microsoft Research (MSR), in the IWSLT 2010 evaluation campaign. We participated in all tracks of the DIALOG task (Chinese/English). While we follow the general training and decoding routine of statistical machine translation (SMT) and that of MT(More)