Elsevier

Pattern Recognition Letters

Volume 29, Issue 9, 1 July 2008, Pages 1218-1229
Pattern Recognition Letters

Word level multi-script identification

https://doi.org/10.1016/j.patrec.2008.01.027Get rights and content

Abstract

We report an algorithm to identify the script of each word in a document image. We start with a bi-script scenario which is later extended to tri-script and then to eleven-script scenarios. A database of 20,000 words of different font styles and sizes has been collected and used for each script. Effectiveness of Gabor and discrete cosine transform (DCT) features has been independently evaluated using nearest neighbor, linear discriminant and support vector machines (SVM) classifiers. The combination of Gabor features with nearest neighbor or SVM classifier shows promising results; i.e., over 98% for bi-script and tri-script cases and above 89% for the eleven-script scenario.

Introduction

Demand for tools with capability to recognize, search and retrieve documents from multi-script and multi-lingual environments, has increased many folds in the recent years. Thus, recognition of the script and language play an important part for automated processing and utilization of documents. Plenty of research has been carried out for accomplishing this task of script recognition at a paragraph/block or line level. While the former assumes that a full document page is of the same script, the latter imagines documents to contain text from multiple scripts but changing at the level of the line. Though the latter is a realistic assumption in some cases, most of the practical situations have the script changing with words. In Fig. 1a and b, we show two text images where the script changes at the word level. Fig. 1a shows a bi-script document, where the presence of interspersed English words in a document of Devanagari script is clearly seen. Similarly, using Fig. 1b we show the variation of script at both line and word level and the presence of three scripts in a document, which is a common occurrence in India.

Most of the optical character recognition (OCR) systems are designed using statistical pattern recognition techniques. It is generally observed that these systems generate good output for specific kinds of documents and when the number of classes is reasonable. Including all the various symbols used for writing in the world, together in one reference set as different classes will be prohibitively high. Most of the Indian scripts have 13 vowels (V) and about 35 consonants (C). As reported by the review paper by Pati and Ramakrishnan (2005a), unlike Roman script, in Indian scripts a consonant combines with another consonant or a vowel to generate a completely different symbol. This is demonstrated in Fig. 2 where two different symbols combine to generate a completely new symbol. Fig. 2a presents the combination of two consonants in Odiya script while Fig. 2b presents a sample case in Devanagari. Thus the consonant–vowel (CV) and consonant–consonant–vowel (CCV) combinations, which appear frequently, generate a huge set of graphemes.

In Telugu script alone, Rajasekaran and Deekshatulu (1977), have identified some 2000 symbols which are used regularly. Kannada, with a very similar script and rules, has comparable number of symbols. Devanagari and similar scripts have close to 6000 such combinations each and other scripts have around 300 symbols. Thus, script identification can act as the preliminary level of filter and reduce the complexity of the search for classifying a test pattern. Moreover, for scripts such as Devanagari and Bangla, its identification decides the further course of processing. This includes removal of the shirorekha,1 the headline, from the word to separate the different symbols forming the word so that each of them can be individually recognized. Thus, identification of the script is one of the necessary challenges for the designer of OCR systems when dealing with multi-script documents.

Section snippets

Literature review

Quite a few results have been reported in the literature, identifying the scripts at the level of paragraphs or lines. We review this literature in the section below. However, very few research works deal with script identification at the word level, which we review in Section 2.2.

System description

In the present work, we explore the effectiveness of our approach (Pati and Ramakrishnan, 2006) in recognizing word level change of script up to eleven-scripts. We studied the structural properties of the scripts before designing an identifier for these scripts. Since, the scripts of Assamese and Bengali are nearly the same, we consider them as one script. An observation of these eleven Indian scripts reveals the following properties:

  • Bengali, Devanagari and Punjabi scripts have a shirorekha

Data description

Document images are scanned using: (i) Umax Astra 5400 and (ii) Hewlett-Packard Scanjet 2200c scanners at 300 dpi resolution and stored in 8-bit gray format. The images are scanned from magazines, newspapers, books and laser printed documents. Variations in printing style and sizes are ensured. Eleven different scripts are considered for this database, namely, Bengali (Bangla), Roman (English), Devanagari (Hindi/Marathi), Gujarati, Kannada, Malayalam, Odiya, Gurumukhi (Punjabi), Tamil, Telugu

Results and discussion

Most of the multi-lingual documents in India are bi-script in nature. So, all bi-script cases are handled first. Based on the encouraging results that we got in these experiments, we decided to extend the experiments to tri-script case as well. Since most of the official documents of national importance have three scripts, such an experiment is justified. Finally, we also explore the possibility of recognizing the script of a word without any prior information. This is a blind script

Conclusion

The combination of Gabor filter bank with either SVM or NN classifier handles the important issue of script recognition at the word level quite well. For most cases, NNC performs at par with SVM and they both outperform LDC. However, the actual performance is script dependent. For example, using the Gabor-NNC combination, the overall classification performance for the tri-script combination involving Kannada, Devanagari and English is 99.6%, whereas the average correct recognition is only 89.2%

References (36)

  • R. Muralishankar et al.

    Modification of pitch using DCT in the source domain

    Speech Comm.

    (2004)
  • S.N.S. Rajasekaran et al.

    Recognition of printed Telugu characters

    Comput. Graph. Image Process.

    (1977)
  • Ablavsky, V., Stevens, M.R., 2003. Automatic feature selection with applications to script identification of degraded...
  • C.J.C. Burges

    A tutorial on support vector machines for pattern recognition

    Data Mining Knowl. Discov.

    (1998)
  • Chan, W., Sivaswamy, J., 1999. Local energy analysis for text script classification. In: Proc. Image Vision Comput.,...
  • Chaudhuri, S., Seth, R., 1999. Trainable script identification strategies for Indian languages. In: Proc. Internat....
  • Chaudhuri, A.R., Mandal, A.K., Chaudhuri, B.B., 2002. Page layout analyser for multilingual Indian documents. In: Proc....
  • J. Daugman

    Uncertainty relation for resolution in space, spatial frequency and orientation optimized by two-dimensional visual cortical filters

    J. Opt. Soc. Amer. A

    (1985)
  • D. Dhanya et al.

    Script identification in printed bilingual documents

    Sadhana

    (2002)
  • D.J. Field

    Relation between the statistics of natural images and the response properties of cortical cells

    J. Opt. Soc. Amer. A

    (1987)
  • D. Gabor

    Theory of communication

    J. IEE (London)

    (1946)
  • Gllavata, J., Freisleben, B., 2005. Script recognition in images with complex backgrounds. In: Proc. Fifth IEEE...
  • J. Hochberg et al.

    Automatic script identification from document images using cluster based templates

    IEEE Trans. PAMI

    (1997)
  • Jaeger, S., Ma, H., Doermann, D., 2005. Identifying script onword-level with informational confidence. In: Proc....
  • Joshi, G.D., Garg, S., Sivaswamy, J., 2000. Script identification from Indian documents. In: Seventh IAPR Workshop Doc....
  • U.-V. Koc et al.

    DCT based motion estimation

    IEEE Trans. Image Process.

    (1998)
  • Ma, H., Doermann, D., 2003. Gabor filter based multi-class classifier for scanned document images. In: Proc. Internat....
  • Manthalkar, R., Biswas, P.K., 2002. An automatic script identification scheme for Indian languages. In: Proc. Eighth...
  • Cited by (109)

    • MuLTReNets: Multilingual text recognition networks for simultaneous script identification and handwriting recognition

      2020, Pattern Recognition
      Citation Excerpt :

      Spitz [16] used character optical density and vertical distribution of upward concavities to discriminate Han and Latin scripts in machine-printed documents. Pati and Ramakrishnan [17] combined Gabor features with the nearest neighbor or SVM classifier and got promising results of over 98% correct for bi-script cases and over 89% for the eleven-script scenario. The system developed by Zhu et al[18].

    • A Comprehensive Literature Review on Air-written Online Handwritten Recognition

      2024, International Journal of Computing and Digital Systems
    View all citing articles on Scopus
    View full text