• Corpus ID: 586636

A Statistical Model for Lost Language Decipherment

  title={A Statistical Model for Lost Language Decipherment},
  author={Benjamin Snyder and Regina Barzilay and Kevin Knight},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and high-level morphemic correspondences. This formulation enables us to encode some of the linguistic intuitions that have guided human… 

Figures and Tables from this paper

Neural Decipherment via Minimum-Cost Flow: From Ugaritic to Linear B

A novel neural approach for automatic decipherment of lost languages with first automatic results in deciphering Linear B, a syllabic language related to ancient Greek, where the model correctly translates 67.3% of cognates.

Feature-Based Decipherment for Machine Translation

The results show that the proposed log-linear model with contrastive divergence outperforms the existing generative decipherment models by exploiting the orthographic features and both scales to large vocabularies and preserves accuracy in low- and no-resource contexts.

Feature-Based Decipherment for Machine Translation

The results show that the proposed log-linear model with contrastive divergence outperforms the existing generative decipherment models by exploiting the orthographic features and both scales to large vocabularies and preserves accuracy in low- and no-resource contexts.

Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

A decipherment model that handles both of these challenges by building on rich linguistic constraints reflecting consistent patterns in historical sound change is proposed, and a measure for language closeness which correctly identifies related languages for Gothic and Ugaritic is proposed.

A Vowel Harmony Testing Algorithm to Aid in Ancient Script Decipherment

  • P. Revesz
  • Linguistics, Computer Science
    2020 24th International Conference on Circuits, Systems, Communications and Computers (CSCC)
  • 2020
An algorithm is developed that can test an important feature of the underlying language, namely the presence of vowel harmony in root words, in the Minoan language, thereby greatly narrowing its possible set of cognate languages.

Name deciphering in unrelated languages: The case study of Farsi and English

The proposed model is a generative non-parametric model that is a customized version of [3] model for name extraction that is able to achieve competitive results in comparison with a supervised model.

Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

This work performs posterior inference over hundreds of languages, leveraging knowledge of known languages and alphabets to uncover general linguistic patterns of typologically coherent language clusters and achieves average accuracy in the unsupervised consonant/vowel prediction task.

Deciphering Foreign Language by Combining Language Models and Context Vectors

A modification of the method shown in (Ravi and Knight, 2011) that is scalable to vocabulary sizes of several thousand words is presented, with better results with only 5% of the computational effort when running the method with an n-gram language model.

A non-parametric model for the discovery of inflectional paradigms from plain text using graphical models over strings

The main goal of this thesis is to build probability models over inflectional paradigms, and therefore to sort the large vocabulary of a morphologically rich language into structured clusters, which can be learned with minimal supervision for any language that has inflectedal morphology.

Unsupervised multilingual learning

A class of probabilistic models that exploit deep links among human languages as a form of naturally occurring supervision allow us to substantially improve performance for core text processing tasks, such as morphological segmentation, part-of-speech tagging, and syntactic parsing.



An Algorithm for Identifying Cognates in Bilingual Wordlists and its Applicability to Machine Translation

  • J. Guy
  • Linguistics
    J. Quant. Linguistics
  • 1994
The theoretical model and the practical algorithm behind the program COGNATE, which has been available since December 1991 in the pc/linguistics subdirectory of the anonymous ftp site garbo.uwasa.fi, are discussed.

Cross-lingual Propagation for Morphological Analysis

The proposed non-parametric Bayesian model effectively combines cross-lingual alignment with target language predictions, and is a potent alternative to projection methods which decompose these decisions into two separate stages of morphological segmentation.

A Probabilistic Approach to Diachronic Phonology

A probabilistic model of diachronic phonology in which individual word forms undergo stochastic edits along the branches of a phylogenetic tree is presented and results validating the model are presented.

Unsupervised Analysis for Decipherment Problems

Techniques for understanding errors and significantly increasing performance in natural language decipherment problems using unsupervised learning are described.

Unsupervised models for morpheme segmentation and morphology learning

Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes and is shown to perform very well compared to a widely known benchmark algorithm on Finnish data.

Identification of Cognates and Recurrent Sound Correspondences in Word Lists

The results of evaluation experiments involving the Indo-Euro pean, Algonquian, and Totonac families indicate that the methods are more accurate than comparable programs, and achieve high precision and recall on various test sets.

Identifying Cognates by Phonetic and Semantic Similarity

Tests performed on vocabularies of four Algonquian languages indicate that the method is capable of discovering on average nearly 75% percent of cognates at 50% precision.

The Reconstruction Engine: A Computer Implementation of the Comparative Method

The implementation of a computer program, the Reconstruction Engine (RE), which models the comparative method for establishing genetic affiliation among a group of languages, and features of RE that make it possible to handle the complex and sometimes imprecise representations of lexical items are discussed.

Writing Systems, Transliteration and Decipherment

The problem of decipherment and how computational methods might be brought to bear on the problem of unlocking the mysteries of as yet undeciphered ancient scripts are discussed, and techniques that have been used in speech recognition and machine translation might be applied.

Automatic Identification of Word Translations from Unrelated English and German Corpora

The current study, based on the assumption that there is a correlation between the patterns of word co-occurrences in corpora of different languages, makes a significant improvement to about 72% of word translations identified correctly.