The influence of example-data homogeneity on EBMT quality

Abstract

Homogeneity of large corpora is still a largely unclear notion. In this study we first make a link between the notions of similarity and homogeneity : a large corpus is made of sets of documents to which may be assigned a score in similarity defined by cross-entropic measures, such similarity being implicitly expressed in the data. The distribution of the similarity scores of such subcorpora may then be interpreted as a representation of the homogeneity of the main corpus. A blatant fact is that the quality of an example-based machine translation (EBMT) system will depend heavily on the training examples it is fed. Being able to tune an MT system to a specific application through a wise selection of training data is therefore a critical issue. From this viewpoint, such a representation of homogeneity may be used to perform corpus adaptation to tune an EBMT system to the particular domain, or sublanguage, of an expected task. In the following study we further describe this framework and compare it with existing methods based on computing linguistic feature frequencies.

8 Figures and Tables

Cite this paper

@inproceedings{Denoual2005TheIO, title={The influence of example-data homogeneity on EBMT quality}, author={Etienne Denoual}, year={2005} }