research-article

A bilingual Gurmukhi-English OCR based on multiple script identifiers and language models

Author:
Gurpreet Singh Lehal

Punjabi University, Patiala, Punjab, India

Punjabi University, Patiala, Punjab, India
View Profile

MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCRAugust 2013Article No.: 3Pages 1–5https://doi.org/10.1145/2505377.2505381

Published:24 August 2013Publication History

MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCR

Pages 1–5

ABSTRACT

English words are frequently encountered in Gurmukhi texts. A monolingual Gurmukhi OCR will recognize such words as garbage. It becomes necessary to add bilingual capability to the Gurmukhi OCR to recognize English text too. But adding bilingual capability reduces the recognition accuracy for monolingual texts due to errors in script identification. Even a system with 99% script identification accuracy results in reduction of 1% recognition accuracy on monolingual text. In this paper, we present a bilingual OCR, which recognizes both English and Gurmukhi scripts without any significant reduction in recognition accuracy as compared to the monolingual Gurmukhi OCR when recognizing monolingual Gurmukhi text. This is achieved by using multiple script identification engines and language models for both English and Gurmukhi scripts. For the first time, such a system has been developed, which recognizes with high accuracy document images containing mixed Gurmukhi and English text or only Gurmukhi/English text.

References

Kunte, R. S., and Samuel, R. S., "A Bilingual Machine-Interface OCR for Printed Kannada and English Text Employing Wavelet Features," In Proceedings of 10th International Conference on Information Technology, (ICIT 2007). pp. 202--207 2007. Google ScholarDigital Library
Rezaee, H., Geravanchizadeh, M. and Razzazi, F., "Automatic language identification of bilingual English and Farsi scripts," In Application of Information and Communication Technologies, 2009. AICT 2009. International Conference on (pp. 1--4). 2009..Google Scholar
Chanda, S., Terrades, O. R. and Pal, U., "SVM based scheme for Thai and English script identification," In Ninth International Conference on Document Analysis and Recognition, 2007. ICDAR 2007. pp. 551--555. 2007 Google ScholarDigital Library
Haboubi, S. Maddouri, S. S. and Amiri, H., "Discrimination between Arabic and Latin from bilingual documents," In International Conference on Communications, Computing and Control Applications (CCCA), 2011, pp. 1--6. 2011.Google ScholarCross Ref
Joshi,G.. Garg,S. and Sivaswamy, J., "Script identification from Indian documents," Document Analysis Systems VII, pp. 255--267. 2006 Google ScholarDigital Library
Dhandra, B. V., Hangarge, M., Hegadi, R. and Malemath, V. S. "Word level script identification in bilingual documents through discriminating features," In International Conference on Signal Processing, Communications and Networking, 2007. ICSCN'07. pp. 630--635. 2007Google Scholar
Pati, P. B. and Ramakrishnan, A. G. "Word level multi-script identification," Pattern Recognition Letters, 29(9), pp. 1218--1229, 2008. Google ScholarDigital Library
Chanda, S. Sinha, S., and U. Pal. "Word-wise English devanagari and oriya script identification, Speech and Language Systems for Human Communication, pp. 244--248, 2004.Google Scholar
Sinha, S. Pal, U. and Chaudhri, B. B. "Word-wise script identification from Indian documents," In Proc. IAPR Int'l Workshop Document Analysis Systems, pp. 310--321, 2004.Google Scholar
Rani, R., Dhir, R. and Lehal, G. S. "Performance analysis of feature extractors and classifiers for script recognition of English and Gurumukhi words," Proceedings of the Workshop on Document Analysis and Recognition (DAR 2012), Mumbai, Publisher ACM, USA. pp. 30--36. 2012 Google ScholarDigital Library
Ghosh, D., Dube, T. and Shivaprasad, A. P. "Script recognition---A review," Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(12), pp. 2142--2161, 2010 Google ScholarDigital Library
Lehal, G. S. "Optical Character Recognition of Gurmukhi Script using Multiple Classifiers", Proceedings of International Workshop of Multilingual OCR, Article No.7,. 2009. Google ScholarDigital Library
http://code.google.com/p/tesseract-ocr/, last accessed 12 January 2013.Google Scholar

Recommendations

HMM-based script identification for OCR
MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCR

While current OCR systems are able to recognize text in an increasing number of scripts and languages, typically they still need to be told in advance what those scripts and languages are. We propose an approach that repurposes the same HMM-based system ...
Read More
A Complete OCR System for Gurmukhi Script
Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition

Recognition of Indian language scripts is a challenging problem. Work for the development of complete OCR systems for Indian language scripts is still in infancy. Complete OCR systems have recently been developed for Devanagri and Bangla scripts. ...
Read More
Forward-backward Transliteration of Punjabi Gurmukhi Script Using N-gram Language Model
Transliterating the text of a language to a foreign script is called forward transliteration and transliterating the text back to the original script is called backward transliteration. In this work, we perform both forward as well as backward ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCR
August 2013
99 pages
ISBN:9781450321143
DOI:10.1145/2505377
General Chairs:
Venu Govindaraju
University at Buffalo
,
Prem Natarajan
Information Sciences Institute
,
Santanu Chaudhury
IIT Delhi, India
,
Daniel Lopresti
Lehigh University
,
Program Chairs:
Srirangaraj Setlur
University at Buffalo
,
Huaigu Cao
Raytheon BBN Technologies
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
MOCR '13 Paper Acceptance Rate17of34submissions,50%Overall Acceptance Rate17of34submissions,50%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 101
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A bilingual Gurmukhi-English OCR based on multiple script identifiers and language models

MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCR

ABSTRACT

References

Cited By

Recommendations

HMM-based script identification for OCR

A Complete OCR System for Gurmukhi Script

Forward-backward Transliteration of Punjabi Gurmukhi Script Using N-gram Language Model