Language Identification With Confidence Limits

  • Published 2002


A statistical classification algorithm and its application to language identification from noisy input are described. The main innovation is to compute confidence limits on the classification, so that the algorithm terminates when enough evidence to make a clear decision has been made, and so avoiding problems with categories that have similar characteristics. A second application, to genre identification, is briefly examined. The results show that some of the problems of other language identification techniques can be avoided, and illustrate a more important point: that a statistical language process can be used to provide feedback about its own success rate. 1 I n t r o d u c t i o n Language identification is an example of a general class of problems in which we want to assign an input data stream to one of several categories as quickly and accurately as possible. It can be solved using many techniques, including knowledge-poor statistical approaches. Typically, the distribution of n-grams of characters or other objects is used to form a model. A comparison of the input against the model determines the language which matches best. Versions of this simple technique can be found in Dunning (1994) and Cavnar and Trenkle (1994), while an interesting practical implementat ion is described by Adams and Resnik (1997). A variant of the problem is considered by Sibun and Spitz (1994), and Sibun and Reynar (1996), who look at it from the point of view of Optical Character Recognition (OCR). Here, the language model for the OCR system cannot be selected until the language has been identified. They therefore work with so-called shape tokens, which give a very approximate encoding of the characters' shapes on the printed page without needing full-scale OCR. For example, all upper case letters are t reated as being one character shape, all characters with a descender are another, and so on. Sequences of character shape codes separated by white space are assembled into word shape tokens. Sibun and Spitz then determine the language on the basis of linear discriminant analysis (LDA) over word shape tokens, while Sibun and Reynar explore the use of entropy relative to training data for character shape unigrams, bigrams and trigrams. Both techniques are capable of over 90% accuracy for most languages. However, the LDA-based technique tends to perform significantly worse for languages which are similar to one another, such as the Norse languages. Relative entropy performs better, but still has some noticeable error clusters, such as confusion between Croatian, Serbian and Slovenian. What these techniques lack is a measure of when enough information has been accumulated to distinguish one language from another reliably: they examine all of the input data and then make the decision. Here we will look at a different approach which a t t empt s to overcome this by maintaining a measure of the total evidence accumulated for each language and how much confidence there is in the measure. To outline the approach: 1. The input is processed one (word shape) token at a time. For each language, we determine the probability that the token is in that language, expressed as a 95% confidence range. 2. The values for each word are accumulated into an overall score with a confidence range for the input to date, and compared both to an absolute threshold, and with

6 Figures and Tables

Cite this paper

@inproceedings{2002LanguageIW, title={Language Identification With Confidence Limits}, author={}, year={2002} }