ABSTRACT
English words are frequently encountered in Gurmukhi texts. A monolingual Gurmukhi OCR will recognize such words as garbage. It becomes necessary to add bilingual capability to the Gurmukhi OCR to recognize English text too. But adding bilingual capability reduces the recognition accuracy for monolingual texts due to errors in script identification. Even a system with 99% script identification accuracy results in reduction of 1% recognition accuracy on monolingual text. In this paper, we present a bilingual OCR, which recognizes both English and Gurmukhi scripts without any significant reduction in recognition accuracy as compared to the monolingual Gurmukhi OCR when recognizing monolingual Gurmukhi text. This is achieved by using multiple script identification engines and language models for both English and Gurmukhi scripts. For the first time, such a system has been developed, which recognizes with high accuracy document images containing mixed Gurmukhi and English text or only Gurmukhi/English text.
- Kunte, R. S., and Samuel, R. S., "A Bilingual Machine-Interface OCR for Printed Kannada and English Text Employing Wavelet Features," In Proceedings of 10th International Conference on Information Technology, (ICIT 2007). pp. 202--207 2007. Google ScholarDigital Library
- Rezaee, H., Geravanchizadeh, M. and Razzazi, F., "Automatic language identification of bilingual English and Farsi scripts," In Application of Information and Communication Technologies, 2009. AICT 2009. International Conference on (pp. 1--4). 2009..Google Scholar
- Chanda, S., Terrades, O. R. and Pal, U., "SVM based scheme for Thai and English script identification," In Ninth International Conference on Document Analysis and Recognition, 2007. ICDAR 2007. pp. 551--555. 2007 Google ScholarDigital Library
- Haboubi, S. Maddouri, S. S. and Amiri, H., "Discrimination between Arabic and Latin from bilingual documents," In International Conference on Communications, Computing and Control Applications (CCCA), 2011, pp. 1--6. 2011.Google ScholarCross Ref
- Joshi,G.. Garg,S. and Sivaswamy, J., "Script identification from Indian documents," Document Analysis Systems VII, pp. 255--267. 2006 Google ScholarDigital Library
- Dhandra, B. V., Hangarge, M., Hegadi, R. and Malemath, V. S. "Word level script identification in bilingual documents through discriminating features," In International Conference on Signal Processing, Communications and Networking, 2007. ICSCN'07. pp. 630--635. 2007Google Scholar
- Pati, P. B. and Ramakrishnan, A. G. "Word level multi-script identification," Pattern Recognition Letters, 29(9), pp. 1218--1229, 2008. Google ScholarDigital Library
- Chanda, S. Sinha, S., and U. Pal. "Word-wise English devanagari and oriya script identification, Speech and Language Systems for Human Communication, pp. 244--248, 2004.Google Scholar
- Sinha, S. Pal, U. and Chaudhri, B. B. "Word-wise script identification from Indian documents," In Proc. IAPR Int'l Workshop Document Analysis Systems, pp. 310--321, 2004.Google Scholar
- Rani, R., Dhir, R. and Lehal, G. S. "Performance analysis of feature extractors and classifiers for script recognition of English and Gurumukhi words," Proceedings of the Workshop on Document Analysis and Recognition (DAR 2012), Mumbai, Publisher ACM, USA. pp. 30--36. 2012 Google ScholarDigital Library
- Ghosh, D., Dube, T. and Shivaprasad, A. P. "Script recognition---A review," Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(12), pp. 2142--2161, 2010 Google ScholarDigital Library
- Lehal, G. S. "Optical Character Recognition of Gurmukhi Script using Multiple Classifiers", Proceedings of International Workshop of Multilingual OCR, Article No.7,. 2009. Google ScholarDigital Library
- http://code.google.com/p/tesseract-ocr/, last accessed 12 January 2013.Google Scholar
Recommendations
HMM-based script identification for OCR
MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCRWhile current OCR systems are able to recognize text in an increasing number of scripts and languages, typically they still need to be told in advance what those scripts and languages are. We propose an approach that repurposes the same HMM-based system ...
A Complete OCR System for Gurmukhi Script
Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern RecognitionRecognition of Indian language scripts is a challenging problem. Work for the development of complete OCR systems for Indian language scripts is still in infancy. Complete OCR systems have recently been developed for Devanagri and Bangla scripts. ...
Forward-backward Transliteration of Punjabi Gurmukhi Script Using N-gram Language Model
Transliterating the text of a language to a foreign script is called forward transliteration and transliterating the text back to the original script is called backward transliteration. In this work, we perform both forward as well as backward ...
Comments