Automatic Language Identification for Persian and Dari texts

  title={Automatic Language Identification for Persian and Dari texts},
  author={Shervin Malmasi}
We present the first empirical study of distinguishing Persian and Dari texts at the sentence level, using discriminative models. As Dari is a low-resourced language, we developed a corpus of 28k sentences (14k per-language) for this task, and using character and word n-grams, we discriminate them with 96% accuracy using a classifier ensemble. Outof-domain cross-corpus evaluation was conducted to test the discriminative models’ generalizability, achieving 87% accuracy in classifying 79k… CONTINUE READING
Highly Cited
This paper has 29 citations. REVIEW CITATIONS

6 Figures & Tables



Citations per Year

Citation Velocity: 8

Averaging 8 citations per year over the last 3 years.

Learn more about how we calculate this metric in our FAQ.