Nicola Bertoldi

Learn More
We describe an open-source toolkit for statistical machine translation whose novel contributions are (a) support for linguistically motivated factors, (b) confusion network decoding, and (c) efficient data formats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools for training,(More)
The 2006 Language Engineering Workshop Open Source Toolkit for Statistical Machine Translation had the objective to advance the current state-of-the-art in statistical machine translation through richer input and richer annotation of the training data. The workshop focused on three topics: factored translation models, confusion network decoding, and the(More)
Research in speech recognition and machine translation is boosting the use of large scale n-gram language models. We present an open source toolkit that permits to efficiently handle language models with billions of n-grams on conventional machines. The IRSTLM toolkit supports distribution of ngram collection and smoothing over a computer cluster, language(More)
Domain adaptation has recently gained interest in statistical machine translation to cope with the performance drop observed when testing conditions deviate from training conditions. The basic idea is that in-domain training data can be exploited to adapt all components of an already developed system. Previous work showed small performance gains by adapting(More)
This paper presents a novel statistical model for cross-language information retrieval. Given a written query in the source language, documents in the target language are ranked by integrating probabilities computed by two statistical models: a query-translation model, which generates most probable term-by-term translations of the query, and a(More)
This paper describes advances in the use of confusion networks as interface between automatic speech recognition and machine translation. In particular, it presents an implementation of a confusion network decoder which significantly improves both in efficiency and performance previous work along this direction. The confusion network decoder results as an(More)
Translation with pivot languages has recently gained attention as a means to circumvent the data bottleneck of statistical machine translation (SMT). This paper tries to give a mathematically sound formulation of the various approaches presented in the literature and introduces new methods for training alignment models through pivot languages. We present(More)
The integration of machine translation in the human translation work flow rises intriguing and challenging research issues. One of them, addressed in this work, is how to dynamically adapt phrase-based statistical MT from user post-editing. By casting the problem in the online machine learning paradigm, we propose a cache-based adaptation technique method(More)
We describe an open-source implementation of minimum error rate training (MERT) for statistical machine translation (SMT). This was implemented within the Moses toolkit, although it is essentially standsalone, with the aim of replacing the existing implementation with a cleaner, more flexible design, in order to facilitate further research in weight(More)
This paper describes advances in the use of confusion networks as interface between automatic speech recognition and machine translation. In particular, it presents a decoding algorithm for confusion networks which results as an extension of a state-of-the-art phrase-based text translation decoder. The confusion network decoder significantly improves both(More)