• Corpus ID: 212737162

HELFI: a Hebrew-Greek-Finnish Parallel Bible Corpus with Cross-Lingual Morpheme Alignment

  title={HELFI: a Hebrew-Greek-Finnish Parallel Bible Corpus with Cross-Lingual Morpheme Alignment},
  author={Anssi Yli-Jyr{\"a} and Josi Purhonen and Matti Liljeqvist and Arto Antturi and Pekka Nieminen and Kari M. R{\"a}ntil{\"a} and Valtter Luoto},
Twenty-five years ago, morphologically aligned Hebrew-Finnish and Greek-Finnish bitexts (texts accompanied by a translation) were constructed manually in order to create an analytical concordance (Luoto et al., eds. 1997) for a Finnish Bible translation. The creators of the bitexts recently secured the publisher’s permission to release its fine-grained alignment, but the alignment was still dependent on proprietary, third-party resources such as a copyrighted text edition and proprietary… 

Tables from this paper

Automated creation of parallel Bible corpora with cross-lingual semantic concordance
A novel approach for automated creation of parallel New Testament corpora with cross-lingual semantic concordance based on Strong’s numbers and a dictionary-based approach and a Conditional Random Field (CRF) model are presented.
Graph Algorithms for Multiparallel Word Alignment
This work exploits the multiparallelity of corpora by representing an initial set of bilingual alignments as a graph and then predicting additional edges in the graph and presents two graph algorithms for edge prediction: one inspired by recommender systems and one based on network link prediction.
Graph Neural Networks for Multiparallel Word Alignment
This work compute high-quality word alignments between multiple language pairs by considering all language pairs together by using graph neural networks to exploit the graph structure and shows that community detection algorithms can provide valuable information for multiparallel word alignment.


Creating a Parallel Corpus from the \ Book of 2000 Tongues "
A project to annotate biblical texts in order to create an aligned multilingual Bible corpus for linguistic research, particularly computational linguistics, including automatically creating and evaluating translation lexicons and semantically tagged texts.
There are a few aspects of linguistic work which are susceptible to standardization. They concern mainly notational matters; the International Phonetic Alphabet is an example, or the transliteration
Morphological inference from bitext for resource-poor languages
Combined, these methods for collecting and analyzing bitext data offer a pathway for the automatic creation of richly-annotated corpora for resource-poor languages, requiring minimal amounts of data and minimal manual analysis.
English-Urdu Religious Parallel Corpus
English-Urdu parallel corpus is a collection of religious texts in English and Urdu language with sentence alignments used for experiments with statistical machine translation.
Deriving Consensus for Multi-Parallel Corpora: an English Bible Study
A method is presented which generates a single corpus-wide multiway alignment: a consensus between 27 versions of the English Bible, language independent and applicable to any multi-parallel corpora.
Evaluating prose style transfer with the Bible
This work identifies a high-quality source of aligned, stylistically distinct text in different versions of the Bible, and provides a standardized split, into training, development and testing data, of the public domain versions in their corpus.
Biblia Hebraica Stuttgartensia
This is the definitive edition of the "Hebrew Bible". It is the original language edition most widely-used by scholars. It is a revision of the third edition of the "Biblia Hebraica" the first Bible
Manual Annotation of Translational Equivalence: The Blinker Project
The annotated texts, the specially-designed annotation tool, and the strategies employed to increase the consistency of the annotations are described, which indicate that the Annotations are reasonably reliable and that the method is easy to replicate.
Annotation Guidelines for Chinese-Korean Word Alignment
Annotation guidelines for Chinese-Korean word alignment are presented through contrastive analysis of morpho-syntactic encodings and the instruction methods exemplified are applicable in developing systematic and comprehensible alignment guidelines for other languages having such different linguistic phenomena.
Bitext Alignment
  • J. Tiedemann
  • Computer Science
    Synthesis Lectures on Human Language Technologies
  • 2011
The essential tasks that have to be carried out when building parallel corpora starting from the collection of translated documents up to sub-sentential alignments are covered, including various approaches to document alignment, sentence alignment, word alignment and tree structure alignment.