Guillaume Wisniewski

Learn More
Naturally-occurring instances of linguistic phenomena are important both for training and for evaluating automatic text processing. When available in large quantities, they also prove interesting material for linguistic studies. In this article, we present WiCoPaCo (Wikipedia Correction and Paraphrase Corpus), a new freely-available resource built by(More)
Using multi-layer neural networks to estimate the probabilities of word sequences is a promising research area in statistical language modeling, with applications in speech recognition and statistical machine translation. However, training such models for large vocabulary tasks is computationally challenging which does not scale easily to the huge corpora(More)
Extant Statistical Machine Translation (SMT) systems are very complex softwares, which embed multiple layers of heuristics and embark very large numbers of numerical parameters. As a result, it is difficult to analyze output translations and there is a real need for tools that could help developers to better understand the various causes of errors. In this(More)
This paper describes LIMSI's submissions to the shared WMT'16 task " Translation of News ". We report results for Romanian-English in both directions, for English to Russian, as well as preliminary experiments on reordering to translate from English into German. Our submissions use mainly NCODE and MOSES along with continuous space models in a(More)
We present a novel translation quality informed procedure for both extraction and scoring of phrase pairs in PBSMT systems. We reformulate the extraction problem in the supervised learning framework in an attempt to take into account the translation quality, while incorporating arbitrary features in order to circumvent alignment errors. One-Class SVMs and(More)
When Part-of-Speech annotated data is scarce, e.g. for under-resourced languages , one can turn to cross-lingual transfer and crawled dictionaries to collect partially supervised data. We cast this problem in the framework of ambiguous learning and show how to learn an accurate history-based model. Experiments on ten languages show significant improvements(More)
In Machine Translation, it is customary to compute the model score of a predicted hypothesis as a linear combination of multiple features, where each feature assesses a particular facet of the hypothesis. The choice of a linear combination is usually justified by the possibility of efficient inference (decoding); yet, the appropriateness of this simple(More)
The search space of Phrase-Based Statistical Machine Translation (PBSMT) systems can be represented under the form of a directed acyclic graph (lattice). The quality of this search space can thus be evaluated by computing the best achievable hypothesis in the lattice, the so-called oracle hypothesis. For common SMT metrics, this problem is however NP-hard(More)
The dissemination of statistical machine translation (SMT) systems in the professional translation industry is still limited by the lack of reliability of SMT outputs, the quality of which varies to a great extent. A critical piece of information would be for MT systems to automatically assess their output translations with automatically derived quality(More)
We integrate semantic information at two stages of the translation process of a state-of-the-art SMT system. A Word Sense Disam-biguation (WSD) classifier produces a probability distribution over the translation candidates of source words which is exploited in two ways. First, the probabilities serve to rerank a list of n-best translations produced by the(More)