Automatic bilingual corpus collection from Wikipedia
@inproceedings{Unitt2016AutomaticBC, title={Automatic bilingual corpus collection from Wikipedia}, author={Mark D. Unitt and Simon. Tite and Pejman Saeghe}, booktitle={TC}, year={2016} }
This is a study to combine a number of existing tec hnologies with newly developed tools to create an automatic tool to assist with corpus collection for machine translation. This study aims to combine technologies for domain classification, domain sour ce identification, and comparable file alignment in to a unified tool. The unified tool will be used to ma ke the corpora collection process more focused and efficient and enable a wider variety of sources to be used.
References
SHOWING 1-4 OF 4 REFERENCES
Building Bilingual Parallel Corpora Based on Wikipedia
- Computer Science2010 Second International Conference on Computer Engineering and Applications
- 2010
Experimental results show that the proposed method of extracting sentence-level alignment by using an extended link-based bilingual lexicon method increase precision, while it reduce the total number of generated candidate pairs.
Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents
- Computer ScienceIWSLT
- 2015
Improvements to current comparable corpora mining methodologies are presented by re- implementation of the comparison algorithms (using Needleman-Wunch algorithm), introduction of a tuning script and computation time improvement by GPU acceleration.
BootCaT: Bootstrapping Corpora and Terms from the Web
- Computer ScienceLREC
- 2004
The BootCaT toolkit, a suite of perl programs implementing an iterative procedure to bootstrap specialized corpora and terms from the web, is introduced and an evaluation of the tools is conducted by applying them to the construction of English and Italian Corpora and term lists from the domain of psychiatry.