Corpus ID: 586636

A Statistical Model for Lost Language Decipherment

@inproceedings{Snyder2010ASM,
  title={A Statistical Model for Lost Language Decipherment},
  author={Benjamin Snyder and R. Barzilay and Kevin Knight},
  booktitle={ACL},
  year={2010}
}
In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and high-level morphemic correspondences. This formulation enables us to encode some of the linguistic intuitions that have guided human… Expand
Neural Decipherment via Minimum-Cost Flow: From Ugaritic to Linear B
TLDR
A novel neural approach for automatic decipherment of lost languages with first automatic results in deciphering Linear B, a syllabic language related to ancient Greek, where the model correctly translates 67.3% of cognates. Expand
Feature-Based Decipherment for Machine Translation
TLDR
The results show that the proposed log-linear model with contrastive divergence outperforms the existing generative decipherment models by exploiting the orthographic features and both scales to large vocabularies and preserves accuracy in low- and no-resource contexts. Expand
Feature-Based Decipherment for Machine Translation
Orthographic similarities across languages provide a strong signal for unsupervised probabilistic transduction (decipherment) for closely related language pairs. The existing decipherment models,Expand
Deciphering Undersegmented Ancient Scripts Using Phonetic Prior
TLDR
A decipherment model that handles both of these challenges by building on rich linguistic constraints reflecting consistent patterns in historical sound change is proposed, and a measure for language closeness which correctly identifies related languages for Gothic and Ugaritic is proposed. Expand
A Vowel Harmony Testing Algorithm to Aid in Ancient Script Decipherment
  • P. Revesz
  • 2020 24th International Conference on Circuits, Systems, Communications and Computers (CSCC)
  • 2020
Previous algorithms for deciphering lost languages assumed knowledge of cognate languages. Since that is not always possible a priori, this paper develops an algorithm that can test an importantExpand
Name deciphering in unrelated languages: The case study of Farsi and English
TLDR
The proposed model is a generative non-parametric model that is a customized version of [3] model for name extraction that is able to achieve competitive results in comparison with a supervised model. Expand
Unsupervised Consonant-Vowel Prediction over Hundreds of Languages
TLDR
This work performs posterior inference over hundreds of languages, leveraging knowledge of known languages and alphabets to uncover general linguistic patterns of typologically coherent language clusters and achieves average accuracy in the unsupervised consonant/vowel prediction task. Expand
Deciphering Foreign Language by Combining Language Models and Context Vectors
TLDR
A modification of the method shown in (Ravi and Knight, 2011) that is scalable to vocabulary sizes of several thousand words is presented, with better results with only 5% of the computational effort when running the method with an n-gram language model. Expand
A non-parametric model for the discovery of inflectional paradigms from plain text using graphical models over strings
TLDR
The main goal of this thesis is to build probability models over inflectional paradigms, and therefore to sort the large vocabulary of a morphologically rich language into structured clusters, which can be learned with minimal supervision for any language that has inflectedal morphology. Expand
Unsupervised multilingual learning
TLDR
A class of probabilistic models that exploit deep links among human languages as a form of naturally occurring supervision allow us to substantially improve performance for core text processing tasks, such as morphological segmentation, part-of-speech tagging, and syntactic parsing. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 29 REFERENCES
An Algorithm for Identifying Cognates in Bilingual Wordlists and its Applicability to Machine Translation
  • J. Guy
  • Computer Science
  • J. Quant. Linguistics
  • 1994
TLDR
The theoretical model and the practical algorithm behind the program COGNATE, which has been available since December 1991 in the pc/linguistics subdirectory of the anonymous ftp site garbo.uwasa.fi, are discussed. Expand
Cross-lingual Propagation for Morphological Analysis
TLDR
The proposed non-parametric Bayesian model effectively combines cross-lingual alignment with target language predictions, and is a potent alternative to projection methods which decompose these decisions into two separate stages of morphological segmentation. Expand
A Probabilistic Approach to Diachronic Phonology
TLDR
A probabilistic model of diachronic phonology in which individual word forms undergo stochastic edits along the branches of a phylogenetic tree is presented and results validating the model are presented. Expand
Unsupervised Analysis for Decipherment Problems
TLDR
Techniques for understanding errors and significantly increasing performance in natural language decipherment problems using unsupervised learning are described. Expand
Unsupervised models for morpheme segmentation and morphology learning
TLDR
Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes and is shown to perform very well compared to a widely known benchmark algorithm on Finnish data. Expand
Identification of Cognates and Recurrent Sound Correspondences in Word Lists
TLDR
The results of evaluation experiments involving the Indo-Euro pean, Algonquian, and Totonac families indicate that the methods are more accurate than comparable programs, and achieve high precision and recall on various test sets. Expand
A Computational Approach to Deciphering Unknown Scripts
TLDR
This work proposes and evaluates computational techniques for deciphering unknown scripts and considers which scripts are easy or hard to decipher, how much data is required, and whether the techniques are robust against language change over time. Expand
Identifying Cognates by Phonetic and Semantic Similarity
TLDR
Tests performed on vocabularies of four Algonquian languages indicate that the method is capable of discovering on average nearly 75% percent of cognates at 50% precision. Expand
The Reconstruction Engine: A Computer Implementation of the Comparative Method
TLDR
The implementation of a computer program, the Reconstruction Engine (RE), which models the comparative method for establishing genetic affiliation among a group of languages, and features of RE that make it possible to handle the complex and sometimes imprecise representations of lexical items are discussed. Expand
Lost Languages: The Enigma of the World's Undeciphered Scripts
Though much has been learned about the languages of lost cultures such as Ancient Egypt and the Mayans, there remain many scripts that have resisted modern efforts to decipher them. Lost LanguagesExpand
...
1
2
3
...