Corpus ID: 201645138

A BLSTM Network for Printed Bengali OCR System with High Accuracy

  title={A BLSTM Network for Printed Bengali OCR System with High Accuracy},
  author={Debabrata Paul and Bidyut. B. Chaudhuri},
This paper presents a printed Bengali and English text OCR system developed by us using a single hidden BLSTM-CTC architecture having 128 units. Here, we did not use any peephole connection and dropout in the BLSTM, which helped us in getting better accuracy. This architecture was trained by 47,720 text lines that include English words also. When tested over 20 different Bengali fonts, it has produced character level accuracy of 99.32% and word level accuracy of 96.65%. A good Indic multi… Expand
An OCR for Classical Indic Documents Containing Arbitrarily Long Words
A Sanskrit specific OCR system for printed classical Indic documents written in Sanskrit is developed, and an attention-based LSTM model for reading Sanskrit characters in line images is presented, setting the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words. Expand
Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese
The target audience of this paper is a general audience with interest in Digital Humanities or in retrieval of accurate full-text and metadata from digital images. Expand
Confronting the Constraints for Optical Character Segmentation from Printed Bangla Text Image
The proposed algorithm is able to segment characters both from ideal and non-ideal cases of scanned or captured images giving a sustainable outcome. Expand
Constraints in Developing a Complete Bengali Optical Character Recognition System
The aim of this research is to analyze the challenges prevalent in developing a Bengali OCR system through robust literature review and implementation, and suggest some possible solutions related to it. Expand


Printed text recognition using BLSTM and MDLSTM for Indian languages
The validated result shows MDLSTM outperforms both BLSTM and tesseract for all the languages included in the experimentation, and the level and number of hidden layers in both the architectures are empirically selected and kept same for allThe languages. Expand
Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks
This work has presented the results of applying RNN to printed Urdu text in Nastaleeq script, and evaluated BLSTM networks for two cases: one ignoring the character's shape variations and the second is considering them. Expand
High-Performance OCR for Printed English and Fraktur Using LSTM Networks
An application of bidirectional LSTM networks to the problem of machine-printed Latin and Fraktur recognition and these recognition accuracies were found without using any language modelling or any other post-processing techniques. Expand
Towards a Robust OCR System for Indic Scripts
A web based OCR system which follows a unified architecture for seven Indian languages, is robust against popular degradations, follows a segmentation free approach, addresses the UNICODE re-ordering issues, and can enable continuous learning with user inputs and feedbacks is proposed. Expand
Recognition of printed Devanagari text using BLSTM Neural Network
This paper proposes a recognition scheme for the Indian script of Devanagari using a Recurrent Neural Network known as Bidirectional LongShort Term Memory (BLSTM) and reports a reduction of more than 20% in word error rate and over 9% reduction in character error rate while comparing with the best available OCR system. Expand
A Hybrid Deep Architecture for Robust Recognition of Text Lines of Degraded Printed Documents
The contributions made in the present study are creation of a moderately large annotated database of degraded Bangla documents towards their recognition studies, development of a Gaussian mixture model based strategy for extraction of text components from complex noisy background of such documents andDevelopment of a line level recognition scheme for degraded Bangle documents. Expand
Text recognition using deep BLSTM networks
A Deep Bidirectional Long Short Term Memory (LSTM) based Recurrent Neural Network architecture for text recognition that uses Connectionist Temporal Classification (CTC) for training to learn the labels of an unsegmented sequence with unknown alignment. Expand
A segmentation-free approach for printed Devanagari script recognition
The results of applying LSTM networks to Devanagari script, where each consonants-consonant conjuncts and consonant-vowel combinations take different forms based on their position in the word are reported. Expand
Multilingual OCR for Indic Scripts
An end-to-end RNN based architecture which can detect the script and recognize the text in a segmentation-free manner is proposed for this purpose and demonstrated for 12 Indian languages and English. Expand
A complete printed Bangla OCR system
A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular script in the world, is presented and extension of the work to Devnagari, the third most popular Script in the World, is discussed. Expand