• Corpus ID: 7559749

Extracting an English-Persian Parallel Corpus from Comparable Corpora

  title={Extracting an English-Persian Parallel Corpus from Comparable Corpora},
  author={Akbar Karimi and Ebrahim Ansari and Bahram Sadeghi Bigham},
Parallel data are an important part of a reliable Statistical Machine Translation (SMT) system. The more of these data are available, the better the quality of the SMT system. However, for some language pairs such as Persian-English, parallel sources of this kind are scarce. In this paper, a bidirectional method is proposed to extract parallel sentences from English and Persian document aligned Wikipedia. Two machine translation systems are employed to translate from Persian to English and the… 

Figures and Tables from this paper

PC-Corpus: A Persian-Chinese Parallel Corpora

The creation of bilingual Persian-Chinese corpus (PC-Corpus), which is the very first corpus for this language pair is illustrated, which is significantly considerable for the future of machine translation on Persian- Chinese language pair.

Parallel Data Extraction using Word Embeddings

This paper proposes to automatically extract parallel sentences from comparable corpora without using any MT system or even any parallel corpus at all, using crosslingual information retrieval (CLIR), average word embeddings, text similarity and a bilingual dictionary.

Effective Bitext Extraction From Comparable Corpora Using a Combination of Three Different Approaches

It is shown that the system is capable of extracting useful parallel sentences with high accuracy, and that the extracted pairs substantially increase translation quality of an MT system trained on the data, as measured by automatic evaluation metrics.

Recent works on Parallel Sentence Extraction from Comparable Corpora

All the recent works done in the field of extraction of parallel sentences from comparable corpora, using word embedding based, machine translation based, and deep learning based approaches are compiles.

Extracting Parallel Sentences from Low-Resource Language Pairs with Minimal Supervision

A new method to create cross domain mappings in a small number of single languages and construct a classifier to extract bilingual parallel sentence pairs and proves the effectiveness of the method in Uygur Chinese low resource language by using machine translation.

Obtaining Parallel Sentences in Low-Resource Language Pairs with Minimal Supervision

A novel methodology to obtain parallel sentences via only a small-size bilingual seed lexicon about hundreds of entries, which can obtain large and high-accuracy bilingual parallel sentences in low-resource language pairs.

ParsiNLU: A Suite of Language Understanding Challenges for Persian

This work introduces ParsiNLU, the first benchmark in Persian language that includes a range of language understanding tasks—reading comprehension, textual entailment, and so on, and presents the first results on state-of-the-art monolingual and multilingual pre-trained language models on this benchmark and compares them with human performance.

LSCP: Enhanced Large Scale Colloquial Persian Language Understanding

This work proposes a “Large Scale Colloquial Persian Dataset” (LSCP), a hierarchically organized in a semantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem.

Unsupervised Word Sense Disambiguation Using Word Embeddings

A novel unsupervised method is proposed to disambiguate words from the first language by deploying a trained word embeddings model of the second language using only a bilingual dictionary.

Morphological Networks for Persian and Turkish: What Can Be Induced from Morpheme Segmentation?

An algorithm that induces morphological networks for Persian and Turkish using morpheme-segmented lexicons, and the experimental results show that the accuracy of segmented initial data influences derivational network quality.



Using English as Pivot to Extract Persian-Italian Parallel Sentences from Non-Parallel Corpora

Experimental results show that using the new pivot based extraction can increase the quality of bilingual corpus significantly and consequently improves the performance of the Persian-Italian SMT system.

Parallel sentence generation from comparable corpora for improved SMT

The approach was applied to Arabic–English and French–English systems using comparable news corpora and considerable improvements were achieved in the BLEU score.

Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

A maximum entropy classifier is trained that, given a pair of sentences, can reliably determine whether or not they are translations of each other and can be applied with great benefit to language pairs for which only scarce resources are available.

Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment

This work advances the state of the art in parallel sentence extraction by modeling the document level alignment, motivated by the observation that parallel sentence pairs are often found in close proximity.

A Generative Model for Extracting Parallel Fragments from Comparable Documents

A generative LDA based model for extracting parallel fragments from comparable documents without any initial parallel data or bilingual lexicon is proposed and experimental results show significant improvement if the extracted sentence fragments generated by the pro- posed method are used in addition to an existing parallel corpus in an SMT task.

Parallel-Wiki: A Collection of Parallel Sentences Extracted from Wikipedia

A collection of parallel sentences extracted from the entire Wikipedia collection of documents for the following pairs of languages: English-German, English-Romanian and English-Spanish is presented.

On the Use of Comparable Corpora to Improve SMT performance

A statistical machine translation system built from small amounts of parallel texts to translate the source side of the non-parallel corpus and the quality of the extracted data is evaluated by showing that it significantly improves the performance of an SMT systems.

A fully unsupervised approach for mining parallel data from comparable corpora

The experiments conducted show that the unsupervised method can be really applied in the case of lacking parallel data and is also applied successfully to a low e-resourced language pair (French-Vietnamese).

Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora

A novel method for extracting parallel sub-sentential fragments from comparable, non-parallel bilingual corpora by analyzing potentially similar sentence pairs using a signal processing-inspired approach, which enables it to extract useful machine translation training data even from very non-Parallel corpora, which contain no parallel sentence pairs.

Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E

An iterative bootstrapping framework based on the principle of “find-one-get-more”, which claims that documents found to contain one pair of parallel sentences must contain others even if the documents are judged to be of low similarity, is presented.