Improving Statistical MT through Morphological Analysis

  title={Improving Statistical MT through Morphological Analysis},
  author={Sharon Goldwater and David McClosky},
In statistical machine translation, estimating word-to-word alignment probabilities for the translation model can be difficult due to the problem of sparse data: most words in a given corpus occur at most a handful of times. With a highly inflected language such as Czech, this problem can be particularly severe. In addition, much of the morphological variation seen in Czech words is not reflected in either the morphology or syntax of a language like English. In this work, we show that using… 

Figures and Tables from this paper

Morphological Analysis for Phrase-Based Statistical Machine Translation
This work presents a language-independent approach to capture morphological information from parallel corpora, and successfully integrate it into the normal machine translation system, and suggests a preliminary model for self-correction on the translation output using the knowledge from both source and target languages.
Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner
The proposed morph-based solution has clear benefits, as morpho logically well motivated structures (phrases) are learned, and the proportion of words left untranslated is clearly reduced.
Morphology-aware alignments for translation to and from a synthetic language
A factored alignment model specifically designed to handle alignments involving a synthetic language (using the case of the Czech:English language pair) is proposed and it is shown that this model can greatly reduce the number of non-aligned words on the English side, yielding more compact translation models.
Statistical Machine Translation (SMT) for Highly-Inflectional Scarce-Resource Language
By using the best one-to-one alignment of the En-like scheme the translation’s quality from Persian to English is improved about 3 points with respect to BLEU measure over the phrase-based SMT.
Statistical Machine Translation into a Morphologically Complex Language
The results of the investigation into phrase-based statistical machine translation from English into Turkish - an agglutinative language with very productive inflectional and derivational word-formation processes are presented and the applicability of BLEU to morphologically complex languages like Turkish is discussed.
This thesis incorporates previously proposed unsupervised morphological segmentation methods into the translation model and combines this segmentation-based system with a Conditional Random Field morphology prediction model, finding the morphology aware models yield significantly more fluent translation output compared to a baseline word-based model.
Improving phrase-based statistical machine translation with morphosyntactic transformation
A phrase-based statistical machine translation approach which uses linguistic analysis in the preprocessing phase and a transformational model based on a probabilistic context-free grammar for syntactic transformation is presented.
Enriching input in Statistical Machine Translation
Manual error analysis shows that the translation of the words annotated (nouns and verbs) improves, but a problem of sparse data is caused, and human evaluation showed that a model combining both noun cases and verb persons has increased the adequacy and deteriorated the fluency of the generated translation.
Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources
This paper addresses the problem of expanding the knowledge of an SMT system without adding parallel data, but extending the knowledge produced during the training phase by inserting artificial entries in the phrase and reordering models using external morphological resources.
Exploiting Morphology and Local Word Reordering in English-to-Turkish Phrase-Based Statistical Machine Translation
This paper presents a scheme for repairing the decoder output by correcting words which have incorrect morphological structure or which are out-of-vocabulary with respect to the training data and language model, to further improve the translations.


Improving SMT quality with morpho-syntactic analysis
It is argued that training data is typically not large enough to sufficiently represent the range of different phenomena in natural languages and that SMT can take advantage of the explicit introduction of some knowledge about the languages under consideration.
Morphological Analysis for Statistical Machine Translation
We present a novel morphological analysis technique which induces a morphological and syntactic symmetry between two languages with highly asymmetrical morphological structures to improve statistical
Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information
The construction of hierarchical lexicon models on the basis of equivalence classes of words are proposed and sentence-level restructuring transformations which aim at the assimilation of word order in related sentences are introduced.
Czech-English dependency-based machine translation
In the evaluation part, this work compares results of the fully automated and the manually annotated processes of building the tectogrammatical representation.
Prague Czech-English Dependency Treebank. Syntactically Annotated Resources for Machine Translation
The Prague Czech- English Dependency Treebank (PCEDT) is introduced, a new Czech-English parallel resource suitable for experiments in structural machine translation and a bilingual syntactically annotated corpus and translation dictionaries.
A Systematic Comparison of Various Statistical Alignment Models
An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models.
Statistical language modeling using the CMU-cambridge toolkit
The CMU Statistical Language Modeling toolkit was re leased in in order to facilitate the construction and testing of bigram and trigram language models and the technology as implemented in the toolkit is outlined.
Statistical Machine Translation: Final Report
A basic statistical MT toolkit is constructed and an MT system for a new language pair Chinese English is built in a single day, and new follow on ideas have developed sporadically.
Statistical machine translation: The fabulous present and future
  • Invited talk at the Workshop on Building and Using Parallel Texts at ACL’05.
  • 2005
Building a Syntactically Annotated Corpus: The Prague Dependency Treebank
  • Eva Hajičová, editor, Issues of Valency and Meaning. Studies in Honor of Jarmila Panevová, pages 12–19. Prague Karolinum, Charles Univer-
  • 1998