• Corpus ID: 6797247

Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents

  title={Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents},
  author={Krzysztof Wołk and Krzysztof Marasek},
The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality translation purposes, due to out-of-vocabulary words and neologisms. This motivates the use of statistical translation systems, which are unfortunately dependent on the quantity and quality of training data. Such systems have a very limited availability especially… 

Figures and Tables from this paper

Enhancing the Assessment of (Polish) Translation in PROMIS Using Statistical, Semantic, and Neural Network Metrics

The result is a semi-automatic semantic evaluation metric for Polish based on the concept of the human-aided translation evaluation metric (HMEANT), which showed that the proposed metrics can help assess translations in PROMIS.

Learning to translate from graded and negative relevance information

A new learning objective is proposed based on structured ramp loss, which learns from graded relevance, explicitly including negative relevance information, for learning to translate by exploiting cross-lingual link structure in multilingual document collections.

Automatic bilingual corpus collection from Wikipedia

This study aims to combine technologies for domain classification, domain sour ce identification, and comparable file alignment in to a unified tool that will be used to assist with corpus collection for machine translation.

7th Symposium on Languages, Applications and Technologies, SLATE 2018, June 21-22, 2018, Guimaraes, Portugal

Kaang is an automatic generator of RESTFul Web applications that will help novice developers to decrease their learning curve while facing the new frameworks and libraries commonly found in the modern Web and speed up the work of expert developers avoiding all the repetitive and bureaucratic work.



Tuned and GPU-Accelerated Parallel Data Mining from Comparable Corpora

Improvements to Yalign's mining methodology are presented by reimplementing the comparison algorithm, introducing a tuning scripts and by improving performance using GPU computing acceleration.

Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment

This work advances the state of the art in parallel sentence extraction by modeling the document level alignment, motivated by the observation that parallel sentence pairs are often found in close proximity.

A light way to collect comparable corpora from the Web

It is shown experimentally that titles can be used to approximate the comparison between documents using full document contents, and the amount of time and resources spent for tasks 1 and 2 is reduced.

Inversion Transduction Grammar Constraints for Mining Parallel Sentences from Quasi-Comparable Corpora

The method introduced exploits Bracketing ITGs to produce the first known results for this problem, and obtains large accuracy gains on this task compared to the expected performance of state-of-the-art models that were developed for the less stringent task of mining comparable sentence pairs.

Mining for Domain-specific Parallel Text from Wikipedia

This paper proposes a method for exploiting Wikipedia articles without worrying about the position of the sentences in the text by means of a customized metric, which combines different similarity criteria.

A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation

A language independent sentence alignment approach based on Polish (not position-sensitive language) to English experiments is proposed and an improvement in MT system score with text processed with described tool is shown.

Polish-English speech statistical machine translation systems for the IWSLT 2014

Various elements of the TED parallel text corpora for the IWSLT 2013 evaluation campaign were used as the basis for training of language models, and for development, tuning and testing of the translation system.

Methods for Collection and Evaluation of Comparable Documents

This chapter describes the work in developing methods for automatically gathering comparable corpora from the Web, specifically for under resourced languages, and an evaluation method is developed to assess the quality of the retrieved documents.

Domain Adaptation via Pseudo In-Domain Data Selection

The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding.

Method for Building Sentence-Aligned Corpus from Wikipedia

The framework of a Machine Translation (MT) bootstrapping method that can simultaneously generate a statistical machine translation (SMT) and a sentence-aligned corpus by using multilingual Wikipedia articles is proposed.