Corpus ID: 44255330

Automatic Bilingual Corpus Collection from Wikipedia

  title={Automatic Bilingual Corpus Collection from Wikipedia},
  author={Mark. Unitt and Simon. Tite and Pejman Saeghe},
  • Mark. Unitt, Simon. Tite, Pejman Saeghe
  • Published 2016
  • This is a study to combine a number of existing tec hnologies with newly developed tools to create an automatic tool to assist with corpus collection for machine translation. This study aims to combine technologies for domain classification, domain sour ce identification, and comparable file alignment in to a unified tool. The unified tool will be used to ma ke the corpora collection process more focused and efficient and enable a wider variety of sources to be used. 

    Figures from this paper.


    Publications referenced by this paper.
    Building Bilingual Parallel Corpora Based on Wikipedia
    • 55
    BootCaT: Bootstrapping Corpora and Terms from the Web
    • 362
    • Open Access
    Topic Modeling
    • 14