Word level multi-script identification

doi:10.1016/j.patrec.2008.01.027

Pattern Recognition Letters

Volume 29, Issue 9, 1 July 2008, Pages 1218-1229

https://doi.org/10.1016/j.patrec.2008.01.027 Get rights and content

Abstract

We report an algorithm to identify the script of each word in a document image. We start with a bi-script scenario which is later extended to tri-script and then to eleven-script scenarios. A database of 20,000 words of different font styles and sizes has been collected and used for each script. Effectiveness of Gabor and discrete cosine transform (DCT) features has been independently evaluated using nearest neighbor, linear discriminant and support vector machines (SVM) classifiers. The combination of Gabor features with nearest neighbor or SVM classifier shows promising results; i.e., over 98% for bi-script and tri-script cases and above 89% for the eleven-script scenario.

Introduction

Demand for tools with capability to recognize, search and retrieve documents from multi-script and multi-lingual environments, has increased many folds in the recent years. Thus, recognition of the script and language play an important part for automated processing and utilization of documents. Plenty of research has been carried out for accomplishing this task of script recognition at a paragraph/block or line level. While the former assumes that a full document page is of the same script, the latter imagines documents to contain text from multiple scripts but changing at the level of the line. Though the latter is a realistic assumption in some cases, most of the practical situations have the script changing with words. In Fig. 1a and b, we show two text images where the script changes at the word level. Fig. 1a shows a bi-script document, where the presence of interspersed English words in a document of Devanagari script is clearly seen. Similarly, using Fig. 1b we show the variation of script at both line and word level and the presence of three scripts in a document, which is a common occurrence in India.

Most of the optical character recognition (OCR) systems are designed using statistical pattern recognition techniques. It is generally observed that these systems generate good output for specific kinds of documents and when the number of classes is reasonable. Including all the various symbols used for writing in the world, together in one reference set as different classes will be prohibitively high. Most of the Indian scripts have 13 vowels (V) and about 35 consonants (C). As reported by the review paper by Pati and Ramakrishnan (2005a), unlike Roman script, in Indian scripts a consonant combines with another consonant or a vowel to generate a completely different symbol. This is demonstrated in Fig. 2 where two different symbols combine to generate a completely new symbol. Fig. 2a presents the combination of two consonants in Odiya script while Fig. 2b presents a sample case in Devanagari. Thus the consonant–vowel (CV) and consonant–consonant–vowel (CCV) combinations, which appear frequently, generate a huge set of graphemes.

In Telugu script alone, Rajasekaran and Deekshatulu (1977), have identified some 2000 symbols which are used regularly. Kannada, with a very similar script and rules, has comparable number of symbols. Devanagari and similar scripts have close to 6000 such combinations each and other scripts have around 300 symbols. Thus, script identification can act as the preliminary level of filter and reduce the complexity of the search for classifying a test pattern. Moreover, for scripts such as Devanagari and Bangla, its identification decides the further course of processing. This includes removal of the shirorekha,¹ the headline, from the word to separate the different symbols forming the word so that each of them can be individually recognized. Thus, identification of the script is one of the necessary challenges for the designer of OCR systems when dealing with multi-script documents.

Section snippets

Literature review

Quite a few results have been reported in the literature, identifying the scripts at the level of paragraphs or lines. We review this literature in the section below. However, very few research works deal with script identification at the word level, which we review in Section 2.2.

System description

In the present work, we explore the effectiveness of our approach (Pati and Ramakrishnan, 2006) in recognizing word level change of script up to eleven-scripts. We studied the structural properties of the scripts before designing an identifier for these scripts. Since, the scripts of Assamese and Bengali are nearly the same, we consider them as one script. An observation of these eleven Indian scripts reveals the following properties:

•
Bengali, Devanagari and Punjabi scripts have a shirorekha

Data description

Document images are scanned using: (i) Umax Astra 5400 and (ii) Hewlett-Packard Scanjet 2200c scanners at 300 dpi resolution and stored in 8-bit gray format. The images are scanned from magazines, newspapers, books and laser printed documents. Variations in printing style and sizes are ensured. Eleven different scripts are considered for this database, namely, Bengali (Bangla), Roman (English), Devanagari (Hindi/Marathi), Gujarati, Kannada, Malayalam, Odiya, Gurumukhi (Punjabi), Tamil, Telugu

Results and discussion

Most of the multi-lingual documents in India are bi-script in nature. So, all bi-script cases are handled first. Based on the encouraging results that we got in these experiments, we decided to extend the experiments to tri-script case as well. Since most of the official documents of national importance have three scripts, such an experiment is justified. Finally, we also explore the possibility of recognizing the script of a word without any prior information. This is a blind script

Conclusion

The combination of Gabor filter bank with either SVM or NN classifier handles the important issue of script recognition at the word level quite well. For most cases, NNC performs at par with SVM and they both outperform LDC. However, the actual performance is script dependent. For example, using the Gabor-NNC combination, the overall classification performance for the tri-script combination involving Kannada, Devanagari and English is 99.6%, whereas the average correct recognition is only 89.2%

References (36)

R. Muralishankar et al.
Modification of pitch using DCT in the source domain
Speech Comm.
(2004)
S.N.S. Rajasekaran et al.
Recognition of printed Telugu characters
Comput. Graph. Image Process.
(1977)
Ablavsky, V., Stevens, M.R., 2003. Automatic feature selection with applications to script identification of degraded...
C.J.C. Burges
A tutorial on support vector machines for pattern recognition
Data Mining Knowl. Discov.
(1998)
Chan, W., Sivaswamy, J., 1999. Local energy analysis for text script classification. In: Proc. Image Vision Comput.,...
Chaudhuri, S., Seth, R., 1999. Trainable script identification strategies for Indian languages. In: Proc. Internat....
Chaudhuri, A.R., Mandal, A.K., Chaudhuri, B.B., 2002. Page layout analyser for multilingual Indian documents. In: Proc....
J. Daugman
Uncertainty relation for resolution in space, spatial frequency and orientation optimized by two-dimensional visual cortical filters
J. Opt. Soc. Amer. A
(1985)
D. Dhanya et al.
Script identification in printed bilingual documents
Sadhana
(2002)
D.J. Field
Relation between the statistics of natural images and the response properties of cortical cells
J. Opt. Soc. Amer. A
(1987)

D. Gabor

Theory of communication

J. IEE (London)

(1946)

Gllavata, J., Freisleben, B., 2005. Script recognition in images with complex backgrounds. In: Proc. Fifth IEEE...

J. Hochberg et al.

Automatic script identification from document images using cluster based templates

IEEE Trans. PAMI

(1997)

Jaeger, S., Ma, H., Doermann, D., 2005. Identifying script onword-level with informational confidence. In: Proc....

Joshi, G.D., Garg, S., Sivaswamy, J., 2000. Script identification from Indian documents. In: Seventh IAPR Workshop Doc....

U.-V. Koc et al.

DCT based motion estimation

IEEE Trans. Image Process.

(1998)

Ma, H., Doermann, D., 2003. Gabor filter based multi-class classifier for scanned document images. In: Proc. Internat....

Manthalkar, R., Biswas, P.K., 2002. An automatic script identification scheme for Indian languages. In: Proc. Eighth...

Cited by (109)

SANet-SI: A new Self-Attention-Network for Script Identification in scene images
2023, Pattern Recognition Letters
Developing an automatic method for identifying scripts in natural scene text images is of great importance for improving performance of multilingual OCR. This paper presents a new Self-Attention Network (SANet-SI) for script identification in natural scene text images. The rationale behind proposing SANet-SI is that each script exhibits its own pattern because of different characteristics of scripts. To extract such observations, we explore self-attention-based CNN with a multi-scale feature extraction approach. The proposed multi-scale feature extraction involves local, global features extraction and fusion of both the features. Furthermore, to extract dominant features from the pool of features that contribute more for script identification, we explore Style-based Recalibration Module (SRM) in a new way. In addition, to improve the performance of the identification and reduce the model size, the proposed model uses the Global Average Pooling (GAP) layer, instead of Fully Connected(FC) layers in this work. The proposed model is evaluated on standard datasets, namely, RRC-MLT2017, SIW-13, and CVSI2015 to show effectiveness over state-of-the-art methods in terms of confusion matrix and classification rate. In addition, we also conducted experiments for Cross Dataset Validation to show that the proposed model is independent of the number of scripts and different datasets.
Advances in online handwritten recognition in the last decades
2022, Computer Science Review
The easy availability and rapid use of online devices like Take note, PDA, smartphones, etc. at an affordable price increase the demand for online handwriting recognition. In this recognition approach, people can provide information through those devices as freely as they are habituated with pen and paper. The advantage of using those devices is that the supplied information is directly stored as timely ordered stroke sequences.The information does not contain noises that may arise in offline recognition while scanning the paper filled up with information. Such advantages make online handwriting recognition a hot research topic over offline recognition. Certain factors affect writing on electronic devices, including the size, speed of writing, shape, angle of letter used, and type of medium, which in turn affect the recognition performance. In this paper, we have addressed various machine learning and deep learning-based approaches along with their performance for recognizing online handwritten characters, words, and texts in diverse scripts.We have elaborately discussed various feature extraction techniques used by the authors following machine learning approaches and described different deep learning architectures for recognition purposes. We have also discussed the advantages and challenges faced by the methodologies for online handwriting recognition and we believe that the findings of the survey will be informative to researchers.
MuLTReNets: Multilingual text recognition networks for simultaneous script identification and handwriting recognition
2020, Pattern Recognition
Citation Excerpt :
Spitz [16] used character optical density and vertical distribution of upward concavities to discriminate Han and Latin scripts in machine-printed documents. Pati and Ramakrishnan [17] combined Gabor features with the nearest neighbor or SVM classifier and got promising results of over 98% correct for bi-script cases and over 89% for the eleven-script scenario. The system developed by Zhu et al[18].
Multilingual handwritten text recognition is often accomplished in two cascaded steps: script identification and handwriting recognition. However, this scheme is not optimal due to error accumulation. To perform simultaneous script identification and handwriting recognition, in this paper, we propose a new framework named multilingual text recognition networks (MuLTReNets). Specifically, the system has four major modules: feature extractor, script identifier, handwriting recognizer and auto-weighter. The feature extractor integrates both spatial and temporal knowledge to encode text images into features shared by the script identifier and recognizer. The script identifier predicts script category from a variable-length sequence incorporating an auto-weighter for balancing different scripts, while the handwriting recognizer adopts long-short term memory (LSTM) and Connectionist Temporal Classification (CTC) to accomplish sequence decoding. Via multi-task learning, the proposed framework can benefit both two multilingual recognition schemes: unified recognition with merged alphabet (MuLTReNetV1) and cascaded script identification-single script recognition with joint training (MuLTReNetV2). We evaluated the performance of the proposed method on handwritten text databases of five languages, which are English, French, Kannada, Urdu, and Bangla. Experimental results demonstrate that our method performs superiorly for both script identification and handwriting recognition. The accuracy of script identification reaches 99.9%. While in handwriting recognition, the proposed system not only outperforms cascade systems but also surpasses systems particularly designed for specific scripts.
OCR-Nets: Variants of Pre-trained CNN for Urdu Handwritten Character Recognition via Transfer Learning
2020, Procedia Computer Science
Deep Convolutional neural networks (CNN) have been among the utmost competitive neural network architectures and have set the state-of-the-art in various fields of computer vision. In this paper, we present OCR-Nets, variants of (AlexNet & GoogleNet) for recognition of handwritten Urdu characters through transfer learning. Our proposed networks are experimented using an integrated dataset. To compare the recognition rate with traditional character recognition methods and to confirm the fairness of the experiment an additional Urdu character dataset is manually generated with different fonts and size. The experimental result shows that OCR-AlexNet and OCR-GoogleNet produce significant performance gains of 96.3% and 94.7% averaged success rate respectively.
Artistic multi-script identification at character level with extreme learning machine
2020, Procedia Computer Science
In this work, a novel problem, namely artistic multi-script identification at character level has been addressed. Two types of documents: real/ natural and synthetic have been used for dataset preparation. After binarizing using Otsu’s global thresholding algorithm, a semi-automatic segmentation technique has been applied for character separation. Some well-known texture based features have been considered from the segmented images and further, they have been converted into lower dimensional space by applying principal component analysis. Those final feature set are classified using an Extreme Learning based classifier and performance are compared with traditional machine learning techniques and other features. Observing the inherent complexity of the multi-script character level datasets, an encouraging outcome has been obtained.
A Comprehensive Literature Review on Air-written Online Handwritten Recognition
2024, International Journal of Computing and Digital Systems

View all citing articles on Scopus

View full text

Word level multi-script identification

Abstract

Introduction

Section snippets

Literature review

System description

Data description

Results and discussion

Conclusion

Speech Comm.

Comput. Graph. Image Process.

A tutorial on support vector machines for pattern recognition

Data Mining Knowl. Discov.

Uncertainty relation for resolution in space, spatial frequency and orientation optimized by two-dimensional visual cortical filters

J. Opt. Soc. Amer. A

Script identification in printed bilingual documents

Sadhana

Relation between the statistics of natural images and the response properties of cortical cells