• Corpus ID: 44255330

Automatic bilingual corpus collection from Wikipedia

  title={Automatic bilingual corpus collection from Wikipedia},
  author={Mark D. Unitt and Simon. Tite and Pejman Saeghe},
This is a study to combine a number of existing tec hnologies with newly developed tools to create an automatic tool to assist with corpus collection for machine translation. This study aims to combine technologies for domain classification, domain sour ce identification, and comparable file alignment in to a unified tool. The unified tool will be used to ma ke the corpora collection process more focused and efficient and enable a wider variety of sources to be used. 

Figures from this paper



Building Bilingual Parallel Corpora Based on Wikipedia

Experimental results show that the proposed method of extracting sentence-level alignment by using an extended link-based bilingual lexicon method increase precision, while it reduce the total number of generated candidate pairs.

Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents

Improvements to current comparable corpora mining methodologies are presented by re- implementation of the comparison algorithms (using Needleman-Wunch algorithm), introduction of a tuning script and computation time improvement by GPU acceleration.

BootCaT: Bootstrapping Corpora and Terms from the Web

The BootCaT toolkit, a suite of perl programs implementing an iterative procedure to bootstrap specialized corpora and terms from the web, is introduced and an evaluation of the tools is conducted by applying them to the construction of English and Italian Corpora and term lists from the domain of psychiatry.

Topic Modeling

  • Zoe Borovsky
  • Computer Science
    Encyclopedia of Machine Learning and Data Mining
  • 2017