Monothetic Separation of Telugu, Hindi and English Text Lines From a Multilingual

Abstract

In a multi-script multi-lingual environment, a document may contain text lines in more than one script/language forms. It is necessary to identify different script regions of the document in order to feed the document to the OCRs of individual language. With this context, this paper proposes to develop a monothetic algorithmic model to identify and separate text lines Telugu, Hindi and English scripts from a printed multilingual document. The proposed method uses the distinct features of the target script and searches for the text lines that possess the anticipated features. Experimentation conducted involved 1500 text lines for learning and 900 text lines for testing. The performance has turned out to be 98.5%.

DOI: 10.1109/ICSMC.2009.5346045

Extracted Key Phrases

5 Figures and Tables

Cite this paper

@inproceedings{Padma2009MonotheticSO, title={Monothetic Separation of Telugu, Hindi and English Text Lines From a Multilingual}, author={M. C. Padma and P. A. Vijaya}, booktitle={SMC}, year={2009} }