skip to main content
10.1145/2505377.2505381acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmocrConference Proceedingsconference-collections
research-article

A bilingual Gurmukhi-English OCR based on multiple script identifiers and language models

Published:24 August 2013Publication History

ABSTRACT

English words are frequently encountered in Gurmukhi texts. A monolingual Gurmukhi OCR will recognize such words as garbage. It becomes necessary to add bilingual capability to the Gurmukhi OCR to recognize English text too. But adding bilingual capability reduces the recognition accuracy for monolingual texts due to errors in script identification. Even a system with 99% script identification accuracy results in reduction of 1% recognition accuracy on monolingual text. In this paper, we present a bilingual OCR, which recognizes both English and Gurmukhi scripts without any significant reduction in recognition accuracy as compared to the monolingual Gurmukhi OCR when recognizing monolingual Gurmukhi text. This is achieved by using multiple script identification engines and language models for both English and Gurmukhi scripts. For the first time, such a system has been developed, which recognizes with high accuracy document images containing mixed Gurmukhi and English text or only Gurmukhi/English text.

References

  1. Kunte, R. S., and Samuel, R. S., "A Bilingual Machine-Interface OCR for Printed Kannada and English Text Employing Wavelet Features," In Proceedings of 10th International Conference on Information Technology, (ICIT 2007). pp. 202--207 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Rezaee, H., Geravanchizadeh, M. and Razzazi, F., "Automatic language identification of bilingual English and Farsi scripts," In Application of Information and Communication Technologies, 2009. AICT 2009. International Conference on (pp. 1--4). 2009..Google ScholarGoogle Scholar
  3. Chanda, S., Terrades, O. R. and Pal, U., "SVM based scheme for Thai and English script identification," In Ninth International Conference on Document Analysis and Recognition, 2007. ICDAR 2007. pp. 551--555. 2007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Haboubi, S. Maddouri, S. S. and Amiri, H., "Discrimination between Arabic and Latin from bilingual documents," In International Conference on Communications, Computing and Control Applications (CCCA), 2011, pp. 1--6. 2011.Google ScholarGoogle ScholarCross RefCross Ref
  5. Joshi,G.. Garg,S. and Sivaswamy, J., "Script identification from Indian documents," Document Analysis Systems VII, pp. 255--267. 2006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dhandra, B. V., Hangarge, M., Hegadi, R. and Malemath, V. S. "Word level script identification in bilingual documents through discriminating features," In International Conference on Signal Processing, Communications and Networking, 2007. ICSCN'07. pp. 630--635. 2007Google ScholarGoogle Scholar
  7. Pati, P. B. and Ramakrishnan, A. G. "Word level multi-script identification," Pattern Recognition Letters, 29(9), pp. 1218--1229, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chanda, S. Sinha, S., and U. Pal. "Word-wise English devanagari and oriya script identification, Speech and Language Systems for Human Communication, pp. 244--248, 2004.Google ScholarGoogle Scholar
  9. Sinha, S. Pal, U. and Chaudhri, B. B. "Word-wise script identification from Indian documents," In Proc. IAPR Int'l Workshop Document Analysis Systems, pp. 310--321, 2004.Google ScholarGoogle Scholar
  10. Rani, R., Dhir, R. and Lehal, G. S. "Performance analysis of feature extractors and classifiers for script recognition of English and Gurumukhi words," Proceedings of the Workshop on Document Analysis and Recognition (DAR 2012), Mumbai, Publisher ACM, USA. pp. 30--36. 2012 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ghosh, D., Dube, T. and Shivaprasad, A. P. "Script recognition---A review," Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(12), pp. 2142--2161, 2010 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Lehal, G. S. "Optical Character Recognition of Gurmukhi Script using Multiple Classifiers", Proceedings of International Workshop of Multilingual OCR, Article No.7,. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. http://code.google.com/p/tesseract-ocr/, last accessed 12 January 2013.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCR
    August 2013
    99 pages
    ISBN:9781450321143
    DOI:10.1145/2505377

    Copyright © 2013 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 24 August 2013

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    MOCR '13 Paper Acceptance Rate17of34submissions,50%Overall Acceptance Rate17of34submissions,50%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader