MT Detection in Web-Scraped Parallel Corpora

  title={MT Detection in Web-Scraped Parallel Corpora},
  author={Spencer Rarrick},
  • Spencer Rarrick
  • Published 2011
The Web is an invaluable source of parallel data, but in recent years it has become polluted with increasing amounts of machine-translated content. Using such data to train an MT system can introduce error and decrease the resulting quality of the system. In this paper, we present an algorithm for filtering machine-translated content from Webscraped parallel corpora, and discuss its application in cleaning such corpora for use in training statistical machine translation systems. We demonstrate… CONTINUE READING


Publications citing this paper.
Showing 1-10 of 12 extracted citations

Similar Papers

Loading similar papers…