• Corpus ID: 18045031

Transliteration Generation and Mining with Limited Training Resources

  title={Transliteration Generation and Mining with Limited Training Resources},
  author={Sittichai Jiampojamarn and Kenneth Dwyer and Shane Bergsma and Aditya Bhargava and Qing Dou and Mi-Young Kim and Grzegorz Kondrak},
We present DirecTL+: an online discriminative sequence prediction model based on many-to-many alignments, which is further augmented by the incorporation of joint n-gram features. Experimental results show improvement over the results achieved by DirecTL in 2009. We also explore a number of diverse resource-free and language-independent approaches to transliteration mining, which range from simple to sophisticated. 

Figures and Tables from this paper

A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining

A novel model is efficient, language pair independent and mines transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings to automatically extract transliterations pairs from parallel corpora.

Robust Transliteration Mining from Comparable Corpora with Bilingual Topic Models

It is demonstrated that this method is able to extract a high-quality bilingual lexicon from a comparable corpus, and the topic model is extended to propose a solution to the out-of-domain problem.

Bootstrapping Transliteration with Constrained Discovery for Low-Resource Languages

This work presents a bootstrapping algorithm that uses constrained discovery to improve generation, and can be used with as few as 500 training examples, which it is shown can be sourced from annotators in a matter of hours.

Leveraging supplementary transcriptions and transliterations via re-ranking

This thesis presents a unified method for leveraging related transliterations or transcription data to improve the performance of a base G2P or machine transliteration system.

Low-Resource G2P and P2G Conversion with Synthetic Training Data

A method for synthesizing training data using a combination of diverse models is proposed and experiment with three transduction approaches in both standard and low-resource settings, as well as on the related task of phoneme-to-grapheme conversion.

Leveraging supplemental representations for sequential transduction

A unified reranking approach is applied to both grapheme-to-phoneme conversion and machine transliteration demonstrating substantial accuracy improvements by utilizing heterogeneous transliterations and transcriptions of the input word.

A Comparative Study of Extremely Low-Resource Transliteration of the World's Languages

In this extremely low-resource task of transliterating around 1000 Bible names from 591 languages into English, it is found that a phrase-based MT system performs much better than other methods, including a g2p system and a neural MT system.

Comparison of Assorted Models for Transliteration

A combination of discriminative, generative, and neural models obtains the best results on the development sets of three neural MT models used in the NEWS 2018 Shared Task on Transliteration.

A Bayesian Alignment Approach to Transliteration Mining

In this article we present a technique for mining transliteration pairs using a set of simple features derived from a many-to-many bilingual forced-alignment at the grapheme level to classify

Leveraging Transliterations from Multiple Languages

This paper proposes a re-ranking method with features based on n-gram alignments as well as system and alignment scores that achieves a relative improvement of over 10% over the base system used on its own and to system combination.



DirecTL: a Language Independent Approach to Transliteration

DirecTL is an online discriminative sequence prediction model that employs a many-to-many alignment between target and source and is able to independently discover many of the language-specific regularities in the training data.

Transliteration as Constrained Optimization

It is shown that the transliteration problem can be formulated as a constrained optimization problem and thus take into account contextual dependencies and constraints among character bi-grams in the two strings.

Weakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora

An (almost) unsupervised learning algorithm for automatic discovery of Named Entities (NEs) in a resource free language, given a bilingual corpora in which it is weakly temporally aligned with a resource rich language.

Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion

The key idea is online discriminative training, which updates parameters according to a comparison of the current system output to the desired output, allowing the model to train all of its components together.

Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion

This work presents a novel technique of training with many-to-many alignments of letters and phonemes, and applies an HMM method in conjunction with a local classification model to predict a global phoneme sequence given a word.

Alignment-Based Discriminative String Similarity

This work proposes an alignment-based discriminative framework for string similarity that achieves exceptional performance; on nine separate cognate identication experiments using six language pairs, it more than double the precision of traditional orthographic measures like Longest Common Subsequence Ratio and Dice’s Coefcient.

A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance

This paper presents discriminative string-edit CRFs, a finite-state conditional random field model for edit sequences between strings, trained on both positive and negative instances of string pairs.

Models of translation equivalence among words

This article presents methods for biasing statistical translation models to reflect bitext properties, and shows how a statistical translation model can take advantage of preexisting knowledge that might be available about particular language pairs.

Learning String-Edit Distance

The stochastic model allows us to learn a string-edit distance function from a corpus of examples and is applicable to any string classification problem that may be solved using a similarity function against a database of labeled prototypes.

Adaptive duplicate detection using learnable string similarity measures

This paper proposes to employ learnable text distance functions for each database field, and shows that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain.