A Study to Recognize Printed Gujarati Characters Using Tesseract OCR

  title={A Study to Recognize Printed Gujarati Characters Using Tesseract OCR},
  author={Milind Kumar Audichya},
  journal={International Journal for Research in Applied Science and Engineering Technology},
  • Milind Kumar Audichya
  • Published 30 September 2017
  • Computer Science
  • International Journal for Research in Applied Science and Engineering Technology
Optical Character Recognition (OCR) is a widely-known technique to recognize the printed text using computer with the help of various peripheral devices. Research works for OCR of many languages scripts is in process and many languages are still far away. Gujarati script is one of the least focused script in research area of OCR as compared to other scripts. A wellknown Open Source OCR Engine called Tesseract which is already used for the recognition of numerous scripts, can be used to… 

Figures from this paper

Implementation of Words and Characters Segmentation of Gujarati Script Using MATLAB
A novel algorithm is proposed which considers input from a text file containing Gujarati script and segments words and characters and is validated with 10 numbers written in words and implemented using MATLAB.
Ekstraksi Karakter Citra Menggunakan Optical Character Recognition Untuk Pencetakan Nomor Kendaraan Pada Struk Parkir
The license plate number can be printed on the parking receipt by extracting the characters from the vehicle image which is generally acquired at the parking entrance portal by using the Optical Character Recognition method using the Tesseract library.
Deep Learning Approach for Spoken Digit Recognition in Gujarati Language
This research paper seeks to achieve recognition of ten Gujarati digits from zero to nine by using a deep learning approach and maximum 98.7% accuracy is achieved for spoken digits in Gujarati language.
Shot Boundary Detection for Gujarati News Video
This paper presents an efficient video shot boundary detection method based on visual information-based approach which use histogram difference and rank to determine shot boundary.


An OCR for separation and identification of mixed English — Gujarati digits using kNN classifier
  • S. Chaudhari, R. Gulati
  • Computer Science
    2013 International Conference on Intelligent Systems and Signal Processing (ISSP)
  • 2013
An OCR system that separates and identify mixed English-Gujarati digits and gives average accuracy of 99.26% for Gujarati digits, 99.20% for English digits, and overall accuracy 99.23%.
Classification of offline gujarati handwritten characters
A new Combination of Structural and Statistical methods (Freeman chain code, Hu's invariant moment and center of mass) to extract feature vectors results into good amount of accuracy.
Zone identification in the printed Gujarati text
A sophisticated method for accurate zone detection in images of printed Gujarati is proposed and it is expected that this approach shall make the way smoother for the design and development of Gujarati OCR systems for complete character sets.
Optical Character Recognition by Open source OCR Tool Tesseract: A Case Study
A comparative study of this tool with other commercial OCR tool Transym OCR by considering vehicle number plate as input and compared these tools based on various parameters are concluded.
Integrating Bangla script recognition support in tesseract OCR
This paper presents a complete methodology to integrate Bangla script recognition support in Tesseract, and shows how this support can be integrated into existing OCR engines.
Gujarati Handwritten Character Recognition Using Hybrid Method Based On Binary Tree-Classifier And K-Nearest Neighbour
A hybrid approach based on tree classifier and k-Nearest Neighbor for recognition of handwritten Gujarati characters and a success rate of 63% is achieved is acceptable, as it is one of the few attempts to recognize whole character set of Gujarati handwritten characters.
Shirorekha Chopping Integrated Tesseract OCR Engine for Enhanced Hindi Language Recognition
This paper presents a complete methodology to improve The Hindi Language Recognition accuracy, and presents comparison with other Devanagari OCR engines available on the basis of recognition accuracy, processing time, font variations and database size.
An Overview of the Tesseract OCR Engine
  • R. Smith
  • Computer Science
    Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)
  • 2007
The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview. Emphasis is placed on aspects that are novel or at
How to train Tesseract 3.01 - Cédric Verstraeten
  • 16 02 2017. [Online]. Available: https://blog.cedric.ws/how-to-train-tesseract-301.
  • 2017
Gujarati language -wikipedia