Pr ep rin t Non-linear Mapping for Improved Identification of 1300 + Languages

Non-linear mappings of the form P (ngram)γ and log(1+τP (ngram)) log(1+τ) are applied to the n-gram probabilities in five trainable open-source language identifiers. The first mapping reduces classification errors by 4.0% to 83.9% over a test set of more than one million 65-character strings in 1366 languages, and by 2.6% to 76.7% over a subset of 781… CONTINUE READING